Context engineering is the art and science of managing what information is included in the context window of large language models (LLMs) and AI agents at each step of their operation. As LLMs become more capable and agents more autonomous, effective context engineering is essential for performance, cost, and reliability.

What Context Engineering Actually Is

Context engineering is the systematic design of AI systems that:

  1. Capture relevant information from every interaction
  2. Store this information in structured, retrievable formats
  3. Analyze patterns and relationships across data points
  4. Apply insights to improve future interactions
  5. Evolve understanding over time through continuous learning

Context Data Types and Collection Strategies

{
  "customer_id": "unique_identifier",
  "preferences": {
    "communication_style": "direct",
    "preferred_channels": ["email", "sms"],
    "timezone": "EST",
    "language": "en-US",
    "accessibility_needs": []
  },
  "behavioral_patterns": {
    "peak_activity_times": ["9-11am", "2-4pm"],
    "typical_session_length": "15-20 minutes",
    "decision_making_style": "analytical",
    "response_time_expectations": "immediate"
  },
  "relationship_history": {
    "customer_since": "2022-03-15",
    "lifetime_value": 15000,
    "satisfaction_score": 4.7,
    "escalation_triggers": ["billing issues", "technical problems"]
  }
}

Six Specialized Memory Types:

  • Core Memory: Essential system context
  • Episodic Memory: User-specific events and experiences
  • Semantic Memory: Concepts and named entities
  • Procedural Memory: Step-by-step instructions
  • Resource Memory: Documents and media
  • Knowledge Vault: Critical verbatim information

Extending the LLM Context

Long Context Window Extensions

Technical Methods:

  • Modified Positional Encoding: Techniques like RoPE (Rotary Position Embedding), ALiBi, and Context-Adaptive Positional Encoding (CAPE) that allow models to handle longer sequences

  • Efficient Attention Mechanisms: Methods like Core Context Aware Attention (CCA-Attention), which reduces computational complexity while maintaining performance

  • Memory-Efficient Transformers: Approaches like Emma and MemTree that segment documents while maintaining fixed GPU memory usage

A Survey of Context Engineering for Large Language Models

Components of Context Engineering

  • Context Retrieval and Generation
  • Context Processing
  • Context Management

StreamingLLM

When the LLM process the conversation of long it will not focus the past conversation which is out of the context window when responding to the user to sovle this

Streaming LLM where they allocate a predefined space called attention sink of let say 512tokens to keep the important message that required for long conversation it also changed based on current conversation it will keep get updated so the LLM would able to answer properly

let say we have said i am going to london on tonight and after few conversation when LLM is processing it leave the old the message that is out of context window so if we ask i wont able to answer but streaming LLM will keep the important message in context window.

Best resources to check

LongMemEval benchmark

Benchmarking Chat Assistants on Long-Term Interactive Memory (

Memory ComponentPurpose and FunctionKey Information Stored
Core MemoryStores high-priority, persistent information that must always remain visible to the agent, divided into persona (agent identity) and human (user identity) blocks. It maintains compactness by triggering a controlled rewrite process if capacity exceeds 90%.User preferences, enduring facts (name, cuisine enjoyment).
Episodic MemoryCaptures time-stamped events and temporally grounded interactions, functioning as a structured log to track change over time.Events, experiences, user routines, including event type, summary, details, actor, and timestamp.
Semantic MemoryMaintains abstract knowledge and factual information independent of time, serving as a knowledge base for general concepts, entities, and relationships.Concepts, named entities, relationships (e.g., social graph), organized hierarchically in a tree structure.
Procedural MemoryStores structured, goal-directed processes and actionable knowledge that assists with complex tasks.How-to guides, operational workflows, step-by-step instructions.
Resource MemoryHandles full or partial documents and multi-modal files that the user is actively engaged with (but do not fit other categories).Documents, transcripts, images, or voice transcripts.
Knowledge VaultA secure repository for verbatim and sensitive information that must be preserved exactly, protected via access control for high-sensitivity entries.Credentials, addresses, phone numbers, API keys.

Context compressoer

Agentic Context Engineering

we have 3 agents

  • Generator: This is the agent’s “brain” during execution. It’s the component that receives the original task (e.g., “Split the cable bill among roommates”) and attempts to solve it using the current context the accumulated playbook. It doesn’t have the full answer. It doesn’t know every rule. It only has what’s been written so far in the playbook: a collection of past bullets. The Generator reads the playbook, applies any relevant insights, generates a reasoning trace, and outputs code or a response.

  • Reflector: This is where the learning begins. The Reflector receives the Generator’s full trajectory—and, if available, the ground truth. It’s not a chatbot summarizing what happened. It’s an expert auditor.

  • Curator: The Curator takes the Reflector’s diagnostic—as well as the current playable and decides what to add, not what to rewrite.

check here

Technique

  • Thin Tool Layer: Instead of binding dozens of tools to the model, keep the tool definition layer minimal to save tokens,.

  • Progressive Disclosure: Do not dump all instructions or tools into the context at once. Use “skills” (standard operating procedures stored as markdown files) that the agent can read from the file system only when needed,.

  • Context Offloading: When a tool generates a massive output (e.g., a long search result), do not put the whole text in the chat history. Save it to a file and give the model a pointer or a summary. This prevents the context window from being flooded,.

  • Context Caching: Cache the invariant parts of the prompt (like the system prompt and history) to reduce costs and latency

  • The File System Primitive: Leading agents (like Claude Code and Manis) use the file system as their primary memory and workspace. This allows them to offload context, manage “skills,” and manipulate data without bloating the context window

  • The “Ralph Wiggum” Loop: For long-running tasks, use a serial loop where an agent picks a task from a plan file, executes it, updates a progress file, and then terminates. A fresh agent instance then picks up the next task by reading the files. This isolates context and prevents “rot”

  • The Orchestrator-Worker Pattern: Use a smart, “slow thinking” model (like Opus) as the orchestrator and delegate specific tasks to faster models (like Sonnet or specialized coding models)

  • Sub-Agents for Parallelism: Break tasks into parallelizable chunks (e.g., “check these 5 files”) and spawn independent sub-agents for each. This isolates their contexts and speeds up execution

  • Explicit Memory Files: “Continual learning” is effectively achieved by having the agent update specific files (e.g., project.md, skills.md, or a “diary”) after a session,. This allows the agent to “remember” preferences and lessons by reading these files in future sessions

  • Instead of forcing models to choose from predefined tools (tool calling), this approach lets agents write and execute code directly to solve tasks. This works better because LLMs are natively trained on programming and scripting, so code is a more natural and expressive interface than fixed APIs. Giving agents a “computer” (shell + filesystem) allows flexible composition, control flow, and problem-solving without exploding tool definitions.

  • Like the “Jeepa” idea, this work analyzes agent trajectories to find repeatable action patterns, then stores them as reusable skills.

  • These skills are saved in a structured form (e.g., markdown) so the agent can recall and apply them in future tasks.

    1. Stop Generating “Flat” Plans: If your agent performs complex repetitive tasks (e.g., “Onboard Employee”), do not ask the LLM to generate the steps every time. Mine the steps from successful logs, convert them into a graph/process model, and force the agent to execute that graph.
  1. Implement Hybrid Skill Retrieval: Do not rely solely on Vector Databases (Semantic Search) to pick tools. Implement a two-stage retrieval:

    ◦ Stage 1: Semantic Search to narrow the field.

    ◦ Stage 2: Structural/Logic check (Conformance Checking) to ensure the tool actually fits the workflow.

  1. Optimize for Parallelism: Use process mining tools (like PM4Py, implied by the context of process mining algorithms) to analyze your agent’s history. Identify steps that usually happen sequentially but could happen in parallel to reduce user wait time

1. Offload Context

  • Store tool results externally: Save full tool results to the filesystem (not in context) and access on demand with utilities like glob and grep
  • Push actions to the sandbox: Use a small set of function calls (Bash, filesystem access) that can execute many utilities in the sandbox rather than binding every utility as a tool

2. Reduce Context

  • Compact stale results: Replace older tool results with references (e.g., file paths) as context fills; keep recent results in full to guide the next decision
  • Summarize when needed: Once compaction reaches diminishing returns, apply schema-based summarization to the full trajectory

3. Isolate Context

  • Use sub-agents for discrete tasks: Assign tasks to sub-agents with their own context windows, primarily to isolate context (not to divide labor by role)
  • Share context deliberately: Pass only instructions for simple tasks; pass full context (e.g., trajectory and shared filesystem) for complex tasks where sub-agents need more context

“Before writing the solution, analyze where you usually make mistakes on tasks like this.
Then propose tools or scripts that could help you avoid those mistakes.”

The author let the model diagnose its own weaknesses, build real tools to address them, use those tools in practice, and reflect on the results — and still observed no persistent change in behavior across tasks.

PROMPT OPTIMIZATION

Instead of treating the prompt as a static “setup,” the speaker suggests treating it as a dynamic memory bank. The use of “English feedback” allows the system to understand why it failed, rather than just knowing that it failed. This mimics human learning (e.g., taking a mental note to handle a situation differently next time) much more closely than mathematical gradient descent does.

Part 3: The Methodology (The Meta-Prompt Loop)

Summary: The speaker outlines the specific workflow for implementing System Prompt Learning:

  1. Inputs: You take the original data (inputs and outputs) and the original system prompt.

  2. Explanations: You include annotations or explanations of why a previous output was correct or incorrect.

  3. The Meta-Prompt: All of this is fed into a “meta-prompt” (a prompt designed to write prompts).

  4. Output: The result is a new, updated system prompt containing new rules or guidelines.

Analysis: This approach automates the optimization of prompts. Rather than a human engineer manually tweaking the prompt based on intuition, an LLM (the meta-prompt) analyzes the failure cases and the feedback to mathematically insert the necessary behavioral corrections into the system instructions.

Resources