Context Compaction in Claude Code: A Five-Layer Cascade and the Art of Free Summaries
A 200K context window sounds generous — until you're in a moderately complex coding session. Read a few dozen files, run several rounds of grep, execute some bash commands, and you've already burned through most of it. Compaction is inevitable, but compaction itself costs money: you need an LLM call to generate a summary, and the input to that call is the very context you're trying to compress. This creates a fascinating engineering trade-off: compact too early and you lose useful information; compact too late and the window overflows; and the cost of compaction itself can't be ignored. Claude Code's answer is a multi-layer cascade: avoid compaction if you can, do it cheaply if you must, and only call the LLM as a last resort.
Layer 0: Persisting Large Results to Disk
Before any compaction happens, Claude Code controls the volume of data entering the context at the source. When a tool returns a result exceeding a threshold (50K characters by default), the full result is written to a file on disk, and only a ~2KB preview plus the file path is kept in the context. The threshold can be overridden per tool via remote configuration, but there's always a global fallback. Writes use O_CREAT|O_EXCL mode — since each tool_use_id is unique, the same result is never written twice, which avoids duplicate writes when microcompact replays messages. The pseudocode looks roughly like this:
1 | def maybe_persist_large_result(tool_result, tool_name, threshold): |
One detail worth noting: the Read tool explicitly sets its threshold to Infinity, exempting itself from the persistence mechanism. The reason is straightforward — persisting Read's output to a file and then having the model Read that file creates a circular dependency. Read controls its own output size via a maxTokens parameter, so it doesn't need external management.
A similar idea shows up as an aggregate budget at the message level: the total size of all tool_result blocks within a single user message is capped at 200K characters. This prevents N parallel tools from each returning 40K and combining into a 400K monster. The aggregate budget has a carefully designed state management scheme: once a tool_result has been "seen" (whether or not it was replaced), its fate is frozen. Results that weren't replaced before will never be replaced later, because doing so would alter an already-cached prompt prefix. Results that were replaced use the exact same replacement text every time (read from a cached Map, zero I/O), guaranteeing byte-level consistency.
Layer 1: Cached Microcompact
This is the most elegant layer, inferred from observed behavior. Claude Code leverages the Anthropic API's cache_edits capability to delete old tool results directly from the server-side cache without invalidating the cached prefix. Local messages remain completely untouched.
1 | API Server Cache |
Only results from specific tools are eligible for cleanup: Bash, Read, Grep, Glob, WebFetch, WebSearch, FileEdit, and FileWrite. The system maintains a time-ordered list of tool call IDs. When the count exceeds a threshold, it keeps the most recent N and deletes the rest via cache_edits. Because only the server-side cache is modified — local message content stays the same — this operation has virtually no impact on prompt cache hit rates. That's precisely why it exists.
There's an isolation concern worth mentioning: cached microcompact runs only on the main thread. If a forked agent (such as session_memory or prompt_suggestion) were to register tool_results in the shared global state, the main thread would try to delete tools that don't exist in its own conversation, causing confusion. After a successful deletion, it also notifies the prompt cache break detector that an upcoming drop in cache reads is expected and shouldn't trigger an alert.
1 | def cached_microcompact(messages, state, config): |
Layer 2: Time-Based Microcompact
When a user steps away for more than 60 minutes and comes back, the server-side prompt cache has almost certainly expired (Anthropic's cache TTL is one hour). Since the cache is cold and the entire prefix needs to be rewritten anyway, this is a good opportunity to clear out old tool results and reduce the amount of data being rewritten.
Beyond tool result cleanup, there's also a separate thinking block
cleanup strategy at the API layer. Under normal conditions all thinking
blocks are preserved (they record the model's reasoning process and help
with subsequent response quality), but when an idle gap of over one hour
is detected, only the most recent turn's thinking is retained. This is
implemented via the API's clear_thinking_20251015 context
edit, which is independent of tool cleanup — the two strategies can be
combined.
1 | def time_based_microcompact(messages, query_source): |
Unlike Cached Microcompact, this layer modifies local message content directly. Since the cache is cold anyway, there's no prefix consistency to protect. The trigger conditions for the two layers are also mutually exclusive: time-based runs first, and if it fires, cached MC is skipped — there's no point doing cache_edits on a cold cache. After firing, it also resets cached MC's global state, because the previously registered tool IDs correspond to server-side cache entries that no longer exist; keeping them around would only cause phantom deletions later.
Layer 3: Session Memory Compact
This is the most interesting design in the entire system.
Session Memory is a background process that
continuously maintains a structured markdown notes file throughout the
session. When compaction is needed, it simply uses these notes as the
summary — no additional LLM call required. The extraction process is
wrapped in sequential() to ensure no two extractions run
concurrently. The template is user-customizable: place your own template
at ~/.claude/session-memory/config/template.md, and the
extraction prompt can be overridden as well.
The notes file template looks like this:
1 | # Session Title |
The background extraction process runs as a forked agent that shares the main session's prompt cache prefix. It's restricted to using only the Edit tool on the notes file — nothing else. Extraction is triggered by a dual threshold on token count and tool call count, with the additional requirement that the most recent assistant response contains no tool calls (extract during conversation lulls, not mid-workflow). An initialization threshold prevents premature extraction at the start of a session, and the update interval is dynamically tuned via remote configuration.
When calculating which messages to keep, there's an API invariant to maintain: tool_use and tool_result must appear in pairs. If the split point falls between a tool_use/tool_result pair, it needs to be extended backward to include the corresponding tool_use. Similarly, multiple assistant messages with the same message.id produced during streaming (thinking, tool_use, etc.) can't be split apart, or normalizeMessagesForAPI would lose thinking blocks during merging.
The compaction logic itself is quite refined — it doesn't discard all old messages, but retains a recent tail:
1 | def session_memory_compact(messages, last_summarized_id): |
There's a neat edge case here: in a resumed session (where the user reopens a previous session), lastSummarizedMessageId doesn't exist, but the notes file may already have content. In this case all messages are treated as candidates for compaction, but the existing notes are still used as the summary.
Each section also has a size cap (2000 tokens per section, 12000 tokens total). Anything exceeding the limit is truncated, with the full notes file path appended so the model can consult it if needed.
Layer 4: Full Compact
When Session Memory is unavailable or post-compaction token count still exceeds the threshold, the system falls back to a full LLM summary. This is the most expensive but most thorough form of compaction. Session Memory Compact can fail for several reasons: the feature flag isn't enabled, the notes file is still an empty template (the session was too short for any extraction), the last-extracted message ID can't be found in the current message list (messages were modified), or the post-compaction token count still exceeds the auto-compact threshold (notes plus retained messages are too large).
Full Compact uses a forked agent, and the core
purpose of this design is to share the main session's prompt cache
prefix. The fork inherits the main session's system prompt, tools, and
complete message history as context, appending only a compaction
instruction at the end. This way, most of the API call's input hits the
existing cache — you only pay for the compaction instruction and the
output. A key constraint is that the fork cannot set maxOutputTokens,
because this parameter affects budget_tokens in the thinking config via
Math.min(budget, maxOutputTokens-1), and the thinking
config is part of the cache key — any mismatch invalidates the cache. If
the forked agent path fails (e.g., no text is returned), the system
falls back to a regular streaming path, where maxOutputTokens can safely
be set since cache sharing is no longer a concern.
The summary prompt uses a two-phase structure. The first phase is an
<analysis> thinking block that has the model trace
the conversation timeline; the second phase is the
<summary> proper, divided into 9 sections: Primary
Request, Key Technical Concepts, Files and Code Sections, Errors and
Fixes, Problem Solving, All User Messages, Pending Tasks, Current Work,
Optional Next Step. Crucially, the analysis block is stripped out after
the summary is extracted — it serves purely as a scratchpad to improve
summary quality:
1 | def format_compact_summary(raw_summary): |
Messages sent to the compaction API also undergo preprocessing: image
blocks are replaced with [image] text markers, and document
blocks with [document]. Images don't help generate text
summaries, and in CCD (Claude Code Desktop) sessions where users
frequently paste screenshots, leaving them in could make the compaction
request itself exceed length limits. Previously compacted skill
discovery/listing attachments are also filtered out, since they'll be
re-injected after compaction.
The compaction prompt opens with a very emphatic NO TOOLS declaration:
1 | CRITICAL: Respond with TEXT ONLY. Do NOT call any tools. |
This isn't paranoia. Because the fork inherits the main session's full tool set (for cache key matching), the model sometimes can't resist calling tools. Under the maxTurns: 1 constraint, a single rejected tool call means no text output, and the entire compaction fails. From observed behavior, this is especially problematic on Sonnet 4.6, with a failure rate of about 2.79%, compared to just 0.01% on 4.5. Hence the tool prohibition appears at both the beginning and end of the prompt, bookend-style.
The compaction request itself can also trigger a prompt-too-long error — after all, you're sending a context that's about to overflow to the API for summarization, and the compaction instructions add even more. The fix is to group messages by API round and discard from the oldest round forward until enough tokens are freed. Up to 3 retries are allowed, and discarded messages are not restored. If even discarding isn't enough, an error is raised. It's a lossy escape hatch, but better than getting stuck.
The Cascade Decision Flow
With each layer understood, let's see how they fit together. autoCompactIfNeeded is the entry point for automatic compaction, called after each query loop iteration. Its design embodies an important principle: forked agents for session memory and compaction must not trigger recursive compaction (when querySource is session_memory or compact, it returns immediately), or you get a deadlock.
1 | autoCompactIfNeeded(messages, context): |
The auto-compaction threshold is calculated as
context_window - max_output_tokens - 13K buffer. For a 200K
window, this triggers at roughly 167K. The manual /compact
command follows a similar path but isn't subject to the circuit breaker
and supports custom compaction instructions.
The circuit breaker was introduced based on concrete data: 1,279 sessions had experienced 50+ consecutive compaction failures (with a maximum of 3,272), collectively wasting approximately 250,000 API calls per day. Setting a cap of 3 consecutive failures stops failed sessions from retrying pointlessly. A single success resets the counter, so normal intermittent failure recovery isn't affected.
The relationship between microcompact and autoCompact is also worth clarifying. Microcompact (Layers 1 and 2) runs before each API call as preprocessing; autoCompact (Layers 3 and 4) runs after each API call, checking whether more aggressive compaction is needed. Their responsibilities are complementary: microcompact handles lightweight incremental cleanup, while autoCompact handles heavyweight global compaction.
The compaction process itself needs keepalive mechanisms. A full compact API call can take 5–10 seconds or longer, during which no other messages flow through the WebSocket connection. For remote sessions, the server might disconnect due to idle timeout. The system sends a heartbeat every 30 seconds: both an HTTP PUT /worker heartbeat and a re-emission of the compacting status event to keep the SDK event stream alive. After compaction completes, the summary message includes the transcript file path, telling the model that if it needs precise details from before compaction (code snippets, error messages), it can go read the full transcript.
Post-Compaction Recovery
Compaction isn't just about deleting old messages. After a full compact, the model's context contains only a summary and a small number of retained messages. All previously read file states, loaded tool instructions, and plan file contents are gone. Without recovery, the model's first action would almost certainly be to re-Read files it was just looking at.
Recovery operations include: re-reading recently accessed files (up to 5, with a total token budget of 50K and a per-file cap of 5K tokens), so the model doesn't have to manually re-Read code it was just viewing. File selection is based on timestamp ordering from readFileState, and excludes plan files and CLAUDE.md (which are restored through other mechanisms). If the retained messages already contain a Read result for a given file, it's skipped to avoid wasting tokens on duplicate injection.
Other recovery steps: re-inject the plan file (if one exists), re-inject loaded skill content (sorted by most recently used, 5K tokens per skill, 25K total), re-inject delta attachments for deferred tools and MCP instructions, and execute SessionStart hooks to restore CLAUDE.md and other context. The truncation strategy for skills preserves the file header, since usage instructions are typically at the top.
There's another easily overlooked detail: after compaction, session
metadata (custom titles, tags) is re-appended to ensure it falls within
the 16KB tail window. Otherwise, subsequent messages would push the
metadata out of the window, causing --resume to display an
auto-generated title instead of the user's custom name. Yet another
detail: the prompt cache break detector is notified to reset its
baseline after compaction, preventing the compaction-induced drop in
cache reads from being flagged as anomalous.
Design Philosophy
1 | Cost |
The core idea behind the entire system can be summed up as "defer as long as possible, keep it as cheap as possible, escalate in stages." From zero-cost disk persistence, to near-zero-cost cache edits, to zero-LLM-cost Session Memory, and only then to a full LLM summary. Each layer tries to avoid invoking the next. An additional benefit of this layered strategy is fault tolerance: if one layer fails, the system naturally falls through to the next rather than erroring out.
Another notable design choice is the extreme deference to prompt cache. The entire raison d'être of Cached Microcompact is to avoid breaking the cached prefix. Full Compact shares the cache via a forked agent. Session Memory's background extraction also shares the cache via a fork. Time-Based Microcompact only modifies content after confirming the cache has expired. Behind almost every design decision lurks the question: "will this cause a cache miss?" In a world where tokens are billed by volume, prompt cache hit rates directly affect cost — and if a compaction system destroys the cache in order to save context, it's just robbing Peter to pay Paul.
Then there's a category of less glamorous but very practical
engineering decisions: handling empty tool results. Some tools (silent
shell commands, MCP servers returning content:[]) produce
empty output. This seems harmless, but on certain models, an empty
tool_result at the end of the prompt can cause the model to misinterpret
it as a conversation boundary and stop responding prematurely. The
system replaces all empty results with a
(toolName completed with no output) marker, giving the
model an unambiguous anchor.
The Session Memory design is especially worth savoring. It essentially amortizes the expensive operation of understanding session content across the entire session lifecycle — performing small incremental extractions at regular intervals, so that when compaction is finally needed, the payoff is a free summary. It transforms summarization from "how do I compress 200K of context" into "how do I maintain a set of always-up-to-date notes." If this sounds like the LLM equivalent of incremental backups, that's more or less the idea.
Of course, beyond this elegant cascade sits a blunt reality: 200K tokens simply isn't enough for heavy coding sessions, and compaction always means information loss. All of this engineering effort has a single ultimate goal: to make "not enough" happen a little later, cost a little less, and bother the user a little less. In a sense, this entire system is a love letter to the finitude of the context window — and the more beautifully it's written, the more it proves the problem has no real solution.