Every coding agent has a dirty secret: it’s not one model. What looks like a single AI assistant from the outside is actually a system of specialized components - a reasoning model for planning, a fast model for tool calls, a search model for retrieval, and increasingly, a compaction model that summarizes the conversation history so the reasoning model doesn’t choke on a 200,000-token context window.
This chapter is about that last piece. Context compaction is the most underappreciated subagent in production AI systems, and it’s where the engineering gets interesting.
The problem: context windows are a lie
Modern LLMs advertise context windows of 128K, 200K, even a million tokens. But in practice, stuffing the full conversation history into every inference call creates three problems:
The first problem is latency. Transformer attention scales quadratically with sequence length - a 150K-token context window means the model spends most of its compute just attending to old conversation, not generating useful output. In production coding agents, this manifests as 2-3 minute waits for what should be a one-second response.
The second is cost. You pay per input token on every API call. In a 20-turn coding session, the context accumulates to 150K+ tokens - and you re-send all of it with every new message. At frontier model pricing, that’s $0.45 per turn, or $9 per session.
The third is subtle but devastating: quality actually degrades with more context. The “lost in the middle” phenomenon is well-documented - models attend strongly to the beginning and end of their context window but lose track of information in the middle. Your agent forgets the architectural decision it made in turn 7 by the time it reaches turn 15.
The solution: a compaction subagent
Context compaction solves all three problems at once. Instead of passing the raw conversation history to the reasoning model, you first pass it through a fast, cheap compaction model that produces a condensed summary - preserving the key facts, decisions, and artifacts while dropping the noise.
The compaction model reads the full 150K-token history and produces an 8K-token summary that captures everything the reasoning model needs: which files were modified, what decisions were made, what errors were encountered, and what the current state of the task is. The reasoning model then operates on 8K tokens instead of 150K - faster, cheaper, and with better recall because the relevant information is dense and front-loaded.
In production, this pattern yields 82% latency reduction and 90% cost reduction while maintaining quality - because the compacted context is actually better than the raw history for downstream reasoning.
What makes a good compaction model
A compaction model needs three properties:
Fast. The whole point is to reduce latency. If compaction itself takes 30 seconds, you’ve lost most of the benefit. This is where diffusion-based LLMs become interesting - models like Mercury 2 generate at 1,000+ tokens per second, meaning a full compaction pass completes in roughly one second.
Accurate on recall. The compaction must preserve facts faithfully. If the user decided in turn 4 to use PostgreSQL instead of MongoDB, that decision must survive compaction. This isn’t summarization - it’s selective information preservation.
Cheap. Compaction runs on every turn. If the compaction model costs as much as the reasoning model, you’ve just moved the cost instead of reducing it. The ideal compaction model is 10-20x cheaper per token than the frontier reasoning model.
Evaluating compaction quality
How do you know if your compaction is actually preserving the information the reasoning model needs? The approach used in production:
- Sample 250 multi-turn agent trajectories from real coding sessions
- Generate 4 probe questions per trajectory, covering:
- Factual recall - “What database was chosen and why?”
- Decision tracking - “Why was approach A abandoned in favor of B?”
- Artifact management - “Which files were modified and what changed?”
- Logical continuation - “What should happen next given the current state?”
- Generate ground-truth answers from the full context
- Score the compacted context’s ability to answer the same questions on a 0–5 scale
A good compaction model scores within 0.3 points of the full context on all four categories. Below that threshold, the downstream reasoning model starts making decisions based on incomplete information.
The multi-model production stack
Context compaction is just one example of a broader pattern: production AI agents are multi-model systems. Companies running coding agents in production typically deploy 7-10 different models, each optimized for a specific subtask:
| Subtask | Model Profile | Why |
|---|---|---|
| Planning/Reasoning | Frontier (Opus 4.6, GPT-5.5) | Needs deep reasoning, tolerates latency |
| Context compaction | Fast diffusion (Mercury 2) | Speed critical, runs every turn |
| Tool search | Small retrieval model | Sub-second response needed |
| Code completion | Specialized coder | High throughput, domain-specific |
| Validation | Cheap classifier | Binary pass/fail, runs frequently |
The key insight is that no single model optimizes all three axes of speed, quality, and cost. A frontier model is high-quality but slow and expensive. A small model is fast and cheap but low-quality. The engineering challenge is routing each subtask to the right point on that tradeoff surface.
Diffusion LLMs: why they matter for this pattern
Traditional autoregressive LLMs generate one token at a time. Each token depends on the previous one: The → cat → sat → on. This creates a sequential bottleneck. If you need 8,000 output tokens at 10ms per token, that’s 80 seconds of wall-clock time.
Diffusion LLMs flip this entirely. Instead of generating tokens left-to-right, they start with a fully masked sequence and iteratively unmask it in parallel, refining all positions simultaneously across 8-20 denoising steps.
How masking-based corruption works
Image diffusion models (Stable Diffusion, DALL-E) corrupt images by adding Gaussian noise to continuous pixel values. But text is discrete - you can’t add noise to the word “cat.” So diffusion LLMs use masking instead of noise.
During training, the model sees complete sentences with random tokens replaced by [MASK]. It learns to predict what the masked tokens should be, given the surrounding unmasked context. The corruption rate varies - sometimes 20% of tokens are masked, sometimes 80%. This teaches the model to reconstruct text at every difficulty level.
During inference, generation works in reverse:
The key insight: at each step, the model predicts all masked positions simultaneously. It then keeps the predictions it’s most confident about and re-masks the rest for another pass. Early steps resolve the easy, high-confidence tokens (function words, obvious syntax). Later steps handle the harder, context-dependent tokens.
This is why the step count adapts dynamically. Simple structured output like JSON needs only 8 steps because most tokens are predictable ({, "key", :, etc.). Complex reasoning or creative text needs 16-20 steps because more positions require iterative refinement.
The result: 1,000+ tokens per second on standard GPU hardware. For context compaction, the compaction step adds roughly one second to the pipeline instead of ten.
# Autoregressive: sequential, one token at a time
# 8,000 tokens × 10ms/token = 80 seconds
# Diffusion: parallel, all tokens refined together
# 16 denoising steps × 60ms/step = ~1 second
Building a compaction subagent
If you’re building a multi-agent coding system, here’s the practical architecture for context compaction:
async function compactContext(
turns: ConversationTurn[],
currentTask: string
): Promise<CompactedContext> {
const rawTokens = countTokens(turns);
// Only compact if context exceeds threshold
if (rawTokens < COMPACTION_THRESHOLD) {
return { turns, wasCompacted: false };
}
// Use fast model for compaction
const summary = await compactionModel.generate({
system: COMPACTION_PROMPT,
messages: [
{
role: "user",
content: formatTurnsForCompaction(turns),
},
],
maxTokens: Math.min(rawTokens * 0.1, 8000),
});
return {
summary,
recentTurns: turns.slice(-3), // Keep last 3 turns verbatim
wasCompacted: true,
compressionRatio: rawTokens / countTokens(summary),
};
}
The key design decisions:
Keep the last N turns verbatim. The most recent 2-3 turns contain the immediate context the user is thinking about. Compacting these loses conversational continuity. Only compact the older history.
Set a compaction threshold. Don’t compact short conversations - the overhead isn’t worth it below 20-30K tokens. Above that, compact aggressively.
Target 10:1 compression. A good compaction model can compress 150K tokens to 15K while preserving all probe-tested recall categories. Going below 5% compression starts losing critical facts.
Include the current task in the compaction prompt. The compaction model should know what the user is currently trying to accomplish - this biases it toward preserving information relevant to the active task.
The broader principle
Context compaction is an instance of a more general pattern in production AI: use fast, cheap models for utility work, and reserve expensive models for reasoning. The same principle applies to tool search (sub-second retrieval via small embedding models), code validation (binary pass/fail via classifiers), and routing (selecting which model handles a request).
The engineering lesson is that building with AI is less about picking the best model and more about designing the right system of models - each one doing what it’s best at, composed into a pipeline that’s fast, cheap, and high-quality simultaneously.
No single model can be all three. But a well-designed system of models can.