grube.ai

Context Engineering

Context windows got bigger. Two million tokens, then ten million (e.g. Llama 4 Scout). The marketing said the problem was solved. Then your agent started forgetting things in turn fifty, hallucinating details that were never in the prompt, and degrading on tasks it nailed an hour ago.

The bottleneck moved from getting a bigger window to filling the one you have with the right tokens, in the right order, at the right time. Andrej Karpathy pinned a name on it in mid-2025: context engineering. The working definition: the strategies for curating and maintaining the optimal set of tokens the model sees during inference. Prompt engineering was about wording. Context engineering is about everything that lands in the model's view.

Prompts still matter

The prompt is a subset of context, and it's the part you most directly control. A bad prompt still wrecks everything downstream, no matter how clever the rest of your pipeline is.

Two ways prompts go wrong.

The first: cramming the prompt with hard rules that try to cover every case. The agent becomes brittle and falls apart the first time a user does something the rules didn't anticipate.

You are a customer support agent.
If user mentions billing AND tier=enterprise AND issue has "refund": template B.
If issue has "cancel": template C. If user is angry: escalate immediately.
If question is technical: route to engineering. If sentiment < 0.3: apologize first.
...

The second: writing the prompt so vaguely that the model has nothing specific to go on.

You are a helpful, accurate, and friendly assistant.

The version that works sits in between. Be specific about what the agent does, what tools it has, and how it should behave, but leave room for the model to handle situations you didn't think of. Splitting the prompt into clearly-marked sections (XML tags or Markdown headers) helps the model keep instructions, tool descriptions, and style notes from bleeding into each other.

<role>
You triage customer support issues for a B2B SaaS billing product.
</role>

<tools>
- search_kb(query): pull articles from the help center
- escalate(reason): hand off to a human agent
</tools>

<style>
Match the user's tone. Acknowledge their problem before answering.
Cite the article you used.
</style>

Few-shot examples

Few-shot prompting means putting a handful of example input-output pairs into the prompt before the real input. The model picks up the pattern and applies it. For tasks where the desired behavior is hard to describe but easy to demonstrate, examples beat instructions.

Classify the following emails as spam or not spam.

Email: "Win a free iPhone! Click the link now!"
Label: spam

Email: "Hi, can we move the meeting to 3pm?"
Label: not spam

Email: "URGENT: Your account has been compromised. Verify now."
Label: spam

Email: "Looped in @sarah on the proposal. Take a look when you get a sec."
Label:

The trap is over-stuffing. If you find yourself adding examples for every edge case you can think of, you're probably teaching the model the wrong thing. A small set of diverse, clean examples works better than twenty messy ones.

Why context isn't free

A million-token context window doesn't mean a million tokens of usable working memory. A 2025 study on 18 different LLMs (GPT-4.1, Claude 4, Gemini 2.5, Qwen3) showed performance degrading as input grew, even on simple tasks like retrieval and word repetition. The 10,000th token is not handled like the 100th. For most current models, the "effective" window where quality stays high tops out around 256k tokens, well short of what the API advertises.

Repeated Words task: score by input length

Chroma 2025 study. Models replicate a sequence of repeated words with one unique word inserted. 50 is the random / refusal floor, 100 is perfect replication.

Higher is better. Most models hit the floor by 8k tokens, well below their advertised context limit.

More material in the context means the model's attention has to spread thinner across all of it. These models were also trained on shorter sequences than the ones you're now feeding them, so the longer the input, the further from familiar ground.

A token in the context window has a cost, even when you're not paying it in dollars.

Four buckets: write, select, compress, isolate

Lance Martin sorted the toolkit into four buckets. Most context-engineering work boils down to one of these.

Write.Save context outside the window. The agent dumps notes, intermediate state, or plans into a file or a state object during a session. Across sessions, it can remember facts about you and pull them back when relevant. Anything that doesn't need to live in active memory can sit on disk and be loaded back when the agent asks for it.

Select.Pull only what's needed into the window. Either you always load it (project rules, coding conventions, persona) or you query for it on demand (search a knowledge base or codebase, pull back the chunks that match). Embedding search alone often falls short in practice; production setups combine semantic search with keyword search and a re-ranking step. The same idea applies to tools. If your agent has dozens, retrieving the relevant subset per turn beats showing all of them every time.

Memory is the long-term version of select. Worth a separate read:

Memory in AI Agents

How different memory types work, how they complement each other, and how to build your own.

Read article →

Compress. Shrink the history. Sometimes you can do it without losing anything: replace a 500-line file dump in the chat history with a path stub like output saved to /src/main.py; the agent re-reads the file if it needs to. Other times you have to actually summarize old turns and accept some detail loss. Try the lossless version first.

Isolate. Split context across separate windows. Sub-agents are the canonical form: a lead agent dispatches sub-agents that explore independently and return short summaries from tens of thousands of tokens of work. The lead never sees the exploration tokens.

Isolation is expensive too.

Four buckets on the same starting context
01

Write

Move context out of the window onto disk so it can be loaded back later.

Before

Context

After

Context

Disk

02

Select

Pull only the items that matter for this turn. Leave the rest behind.

Before

Context

After

Context
03

Compress

Shrink what is already in the window. Same items, smaller footprint.

Before

Context

After

Context
04

Isolate

Split the work across separate windows. Each one stays small.

Before

Context

After

Sub A
Sub B
Same five-block window, four different ways to make room.