Context Compaction in Claude Code: A Five-Layer Cascade and the Art of Free Summaries

Posted on 2026-04-05 Edited on 2026-04-06 In NLP 评论: Views:

A 200K context window sounds generous — until you're in a moderately complex coding session. Read a few dozen files, run several rounds of grep, execute some bash commands, and you've already burned through most of it. Compaction is inevitable, but compaction itself costs money: you need an LLM call to generate a summary, and the input to that call is the very context you're trying to compress. This creates a fascinating engineering trade-off: compact too early and you lose useful information; compact too late and the window overflows; and the cost of compaction itself can't be ignored. Claude Code's answer is a multi-layer cascade: avoid compaction if you can, do it cheaply if you must, and only call the LLM as a last resort.

Layer 0: Persisting Large Results to Disk

Before any compaction happens, Claude Code controls the volume of data entering the context at the source. When a tool returns a result exceeding a threshold (50K characters by default), the full result is written to a file on disk, and only a ~2KB preview plus the file path is kept in the context. The threshold can be overridden per tool via remote configuration, but there's always a global fallback. Writes use O_CREAT|O_EXCL mode — since each tool_use_id is unique, the same result is never written twice, which avoids duplicate writes when microcompact replays messages. The pseudocode looks roughly like this:

def maybe_persist_large_result(tool_result, tool_name, threshold):
    size = len(tool_result.content)
    if size <= threshold:
        return tool_result

    filepath = write_to_disk(tool_result.content, tool_result.id)
    preview = tool_result.content[:2000]
    return ToolResult(
        content=f"<persisted-output>\n"
                f"Output too large ({size}). Saved to: {filepath}\n"
                f"Preview:\n{preview}\n...\n"
                f"</persisted-output>"
    )

One detail worth noting: the Read tool explicitly sets its threshold to Infinity, exempting itself from the persistence mechanism. The reason is straightforward — persisting Read's output to a file and then having the model Read that file creates a circular dependency. Read controls its own output size via a maxTokens parameter, so it doesn't need external management.

A similar idea shows up as an aggregate budget at the message level: the total size of all tool_result blocks within a single user message is capped at 200K characters. This prevents N parallel tools from each returning 40K and combining into a 400K monster. The aggregate budget has a carefully designed state management scheme: once a tool_result has been "seen" (whether or not it was replaced), its fate is frozen. Results that weren't replaced before will never be replaced later, because doing so would alter an already-cached prompt prefix. Results that were replaced use the exact same replacement text every time (read from a cached Map, zero I/O), guaranteeing byte-level consistency.

Layer 1: Cached Microcompact

This is the most elegant layer, inferred from observed behavior. Claude Code leverages the Anthropic API's cache_edits capability to delete old tool results directly from the server-side cache without invalidating the cached prefix. Local messages remain completely untouched.

API Server Cache
+------------------+
| system prompt    | <-- cached prefix
| tools            |     (preserved)
| msg[0]: user     |
| msg[1]: asst     |
| msg[2]: user/tr  | <-- cache_edits: delete this tool_result
| msg[3]: asst     |
| ...              |
| msg[N]: user     | <-- new turn appended
+------------------+

Only results from specific tools are eligible for cleanup: Bash, Read, Grep, Glob, WebFetch, WebSearch, FileEdit, and FileWrite. The system maintains a time-ordered list of tool call IDs. When the count exceeds a threshold, it keeps the most recent N and deletes the rest via cache_edits. Because only the server-side cache is modified — local message content stays the same — this operation has virtually no impact on prompt cache hit rates. That's precisely why it exists.

There's an isolation concern worth mentioning: cached microcompact runs only on the main thread. If a forked agent (such as session_memory or prompt_suggestion) were to register tool_results in the shared global state, the main thread would try to delete tools that don't exist in its own conversation, causing confusion. After a successful deletion, it also notifies the prompt cache break detector that an upcoming drop in cache reads is expected and shouldn't trigger an alert.

def cached_microcompact(messages, state, config):
    compactable_ids = collect_compactable_tool_ids(messages)

    # Register new tool results
    for msg in messages:
        for block in msg.tool_results:
            if block.id not in state.registered:
                state.register(block.id)

    tools_to_delete = state.get_oldest_beyond_threshold(
        trigger=config.trigger_threshold,
        keep_recent=config.keep_recent
    )

    if tools_to_delete:
        # Queue cache_edits for the API layer
        # Local messages stay UNTOUCHED
        pending_cache_edits = create_cache_edits_block(tools_to_delete)
        return messages, pending_cache_edits

    return messages, None

Layer 2: Time-Based Microcompact

When a user steps away for more than 60 minutes and comes back, the server-side prompt cache has almost certainly expired (Anthropic's cache TTL is one hour). Since the cache is cold and the entire prefix needs to be rewritten anyway, this is a good opportunity to clear out old tool results and reduce the amount of data being rewritten.

Beyond tool result cleanup, there's also a separate thinking block cleanup strategy at the API layer. Under normal conditions all thinking blocks are preserved (they record the model's reasoning process and help with subsequent response quality), but when an idle gap of over one hour is detected, only the most recent turn's thinking is retained. This is implemented via the API's clear_thinking_20251015 context edit, which is independent of tool cleanup — the two strategies can be combined.

def time_based_microcompact(messages, query_source):
    last_assistant = find_last_assistant_message(messages)
    gap_minutes = (now() - last_assistant.timestamp) / 60

    if gap_minutes < 60:  # matches server cache TTL
        return None

    compactable_ids = collect_compactable_tool_ids(messages)
    keep_recent = max(1, config.keep_recent)  # always keep at least 1
    keep_set = compactable_ids[-keep_recent:]
    clear_set = compactable_ids[:-keep_recent]

    for msg in messages:
        for block in msg.tool_results:
            if block.id in clear_set:
                block.content = "[Old tool result content cleared]"
    
    return messages

Unlike Cached Microcompact, this layer modifies local message content directly. Since the cache is cold anyway, there's no prefix consistency to protect. The trigger conditions for the two layers are also mutually exclusive: time-based runs first, and if it fires, cached MC is skipped — there's no point doing cache_edits on a cold cache. After firing, it also resets cached MC's global state, because the previously registered tool IDs correspond to server-side cache entries that no longer exist; keeping them around would only cause phantom deletions later.

Layer 3: Session Memory Compact

This is the most interesting design in the entire system. Session Memory is a background process that continuously maintains a structured markdown notes file throughout the session. When compaction is needed, it simply uses these notes as the summary — no additional LLM call required. The extraction process is wrapped in sequential() to ensure no two extractions run concurrently. The template is user-customizable: place your own template at ~/.claude/session-memory/config/template.md, and the extraction prompt can be overridden as well.

The notes file template looks like this:

# Session Title
_A short and distinctive 5-10 word descriptive title..._

# Current State
_What is actively being worked on right now?..._

# Task specification
_What did the user ask to build?..._

# Files and Functions
_What are the important files?..._

# Workflow
_What bash commands are usually run?..._

# Errors & Corrections
_Errors encountered and how they were fixed..._

# Codebase and System Documentation
_Important system components..._

# Learnings
_What has worked well? What has not?..._

# Key results
_Specific output the user asked for..._

# Worklog
_Step by step, what was attempted, done?..._

The background extraction process runs as a forked agent that shares the main session's prompt cache prefix. It's restricted to using only the Edit tool on the notes file — nothing else. Extraction is triggered by a dual threshold on token count and tool call count, with the additional requirement that the most recent assistant response contains no tool calls (extract during conversation lulls, not mid-workflow). An initialization threshold prevents premature extraction at the start of a session, and the update interval is dynamically tuned via remote configuration.

When calculating which messages to keep, there's an API invariant to maintain: tool_use and tool_result must appear in pairs. If the split point falls between a tool_use/tool_result pair, it needs to be extended backward to include the corresponding tool_use. Similarly, multiple assistant messages with the same message.id produced during streaming (thinking, tool_use, etc.) can't be split apart, or normalizeMessagesForAPI would lose thinking blocks during merging.

The compaction logic itself is quite refined — it doesn't discard all old messages, but retains a recent tail:

def session_memory_compact(messages, last_summarized_id):
    session_memory = read_session_memory_file()
    if is_empty_template(session_memory):
        return None  # fall back to full compact

    start_idx = calculate_keep_index(messages, last_summarized_id)
    # Expand backwards until:
    #   - at least 10K tokens kept
    #   - at least 5 messages with text blocks kept
    #   - but never exceed 40K tokens
    messages_to_keep = messages[start_idx:]

    return CompactionResult(
        summary=session_memory,  # FREE! No LLM call!
        messages_to_keep=messages_to_keep
    )

There's a neat edge case here: in a resumed session (where the user reopens a previous session), lastSummarizedMessageId doesn't exist, but the notes file may already have content. In this case all messages are treated as candidates for compaction, but the existing notes are still used as the summary.

Each section also has a size cap (2000 tokens per section, 12000 tokens total). Anything exceeding the limit is truncated, with the full notes file path appended so the model can consult it if needed.

Layer 4: Full Compact

When Session Memory is unavailable or post-compaction token count still exceeds the threshold, the system falls back to a full LLM summary. This is the most expensive but most thorough form of compaction. Session Memory Compact can fail for several reasons: the feature flag isn't enabled, the notes file is still an empty template (the session was too short for any extraction), the last-extracted message ID can't be found in the current message list (messages were modified), or the post-compaction token count still exceeds the auto-compact threshold (notes plus retained messages are too large).

Full Compact uses a forked agent, and the core purpose of this design is to share the main session's prompt cache prefix. The fork inherits the main session's system prompt, tools, and complete message history as context, appending only a compaction instruction at the end. This way, most of the API call's input hits the existing cache — you only pay for the compaction instruction and the output. A key constraint is that the fork cannot set maxOutputTokens, because this parameter affects budget_tokens in the thinking config via Math.min(budget, maxOutputTokens-1), and the thinking config is part of the cache key — any mismatch invalidates the cache. If the forked agent path fails (e.g., no text is returned), the system falls back to a regular streaming path, where maxOutputTokens can safely be set since cache sharing is no longer a concern.

The summary prompt uses a two-phase structure. The first phase is an <analysis> thinking block that has the model trace the conversation timeline; the second phase is the <summary> proper, divided into 9 sections: Primary Request, Key Technical Concepts, Files and Code Sections, Errors and Fixes, Problem Solving, All User Messages, Pending Tasks, Current Work, Optional Next Step. Crucially, the analysis block is stripped out after the summary is extracted — it serves purely as a scratchpad to improve summary quality:

def format_compact_summary(raw_summary):
    # Strip the analysis scratchpad
    summary = re.sub(r'<analysis>.*?</analysis>', '', raw_summary, flags=re.DOTALL)
    # Extract summary content
    match = re.search(r'<summary>(.*?)</summary>', summary, flags=re.DOTALL)
    if match:
        return f"Summary:\n{match.group(1).strip()}"
    return summary

Messages sent to the compaction API also undergo preprocessing: image blocks are replaced with [image] text markers, and document blocks with [document]. Images don't help generate text summaries, and in CCD (Claude Code Desktop) sessions where users frequently paste screenshots, leaving them in could make the compaction request itself exceed length limits. Previously compacted skill discovery/listing attachments are also filtered out, since they'll be re-injected after compaction.

The compaction prompt opens with a very emphatic NO TOOLS declaration:

1
2
3

CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.
Tool calls will be REJECTED and will waste your only turn
— you will fail the task.

This isn't paranoia. Because the fork inherits the main session's full tool set (for cache key matching), the model sometimes can't resist calling tools. Under the maxTurns: 1 constraint, a single rejected tool call means no text output, and the entire compaction fails. From observed behavior, this is especially problematic on Sonnet 4.6, with a failure rate of about 2.79%, compared to just 0.01% on 4.5. Hence the tool prohibition appears at both the beginning and end of the prompt, bookend-style.

The compaction request itself can also trigger a prompt-too-long error — after all, you're sending a context that's about to overflow to the API for summarization, and the compaction instructions add even more. The fix is to group messages by API round and discard from the oldest round forward until enough tokens are freed. Up to 3 retries are allowed, and discarded messages are not restored. If even discarding isn't enough, an error is raised. It's a lossy escape hatch, but better than getting stuck.

The Cascade Decision Flow

With each layer understood, let's see how they fit together. autoCompactIfNeeded is the entry point for automatic compaction, called after each query loop iteration. Its design embodies an important principle: forked agents for session memory and compaction must not trigger recursive compaction (when querySource is session_memory or compact, it returns immediately), or you get a deadlock.

autoCompactIfNeeded(messages, context):
  |
  +-- Circuit breaker: 3 consecutive failures? --> SKIP
  |
  +-- Token count < threshold (~167K for 200K window)? --> SKIP
  |
  +-- Try Session Memory Compact (Layer 3)
  |     |
  |     +-- Success? --> DONE (zero LLM cost)
  |     |
  |     +-- Failed/unavailable? --> continue
  |
  +-- Run microcompact first (Layer 1 or 2)
  |
  +-- Full Compact (Layer 4) via forked agent
        |
        +-- Success? --> reset failure counter
        |
        +-- Failure? --> increment counter (max 3)

The auto-compaction threshold is calculated as context_window - max_output_tokens - 13K buffer. For a 200K window, this triggers at roughly 167K. The manual /compact command follows a similar path but isn't subject to the circuit breaker and supports custom compaction instructions.

The circuit breaker was introduced based on concrete data: 1,279 sessions had experienced 50+ consecutive compaction failures (with a maximum of 3,272), collectively wasting approximately 250,000 API calls per day. Setting a cap of 3 consecutive failures stops failed sessions from retrying pointlessly. A single success resets the counter, so normal intermittent failure recovery isn't affected.

The relationship between microcompact and autoCompact is also worth clarifying. Microcompact (Layers 1 and 2) runs before each API call as preprocessing; autoCompact (Layers 3 and 4) runs after each API call, checking whether more aggressive compaction is needed. Their responsibilities are complementary: microcompact handles lightweight incremental cleanup, while autoCompact handles heavyweight global compaction.

The compaction process itself needs keepalive mechanisms. A full compact API call can take 5–10 seconds or longer, during which no other messages flow through the WebSocket connection. For remote sessions, the server might disconnect due to idle timeout. The system sends a heartbeat every 30 seconds: both an HTTP PUT /worker heartbeat and a re-emission of the compacting status event to keep the SDK event stream alive. After compaction completes, the summary message includes the transcript file path, telling the model that if it needs precise details from before compaction (code snippets, error messages), it can go read the full transcript.

Post-Compaction Recovery

Compaction isn't just about deleting old messages. After a full compact, the model's context contains only a summary and a small number of retained messages. All previously read file states, loaded tool instructions, and plan file contents are gone. Without recovery, the model's first action would almost certainly be to re-Read files it was just looking at.

Recovery operations include: re-reading recently accessed files (up to 5, with a total token budget of 50K and a per-file cap of 5K tokens), so the model doesn't have to manually re-Read code it was just viewing. File selection is based on timestamp ordering from readFileState, and excludes plan files and CLAUDE.md (which are restored through other mechanisms). If the retained messages already contain a Read result for a given file, it's skipped to avoid wasting tokens on duplicate injection.

Other recovery steps: re-inject the plan file (if one exists), re-inject loaded skill content (sorted by most recently used, 5K tokens per skill, 25K total), re-inject delta attachments for deferred tools and MCP instructions, and execute SessionStart hooks to restore CLAUDE.md and other context. The truncation strategy for skills preserves the file header, since usage instructions are typically at the top.

There's another easily overlooked detail: after compaction, session metadata (custom titles, tags) is re-appended to ensure it falls within the 16KB tail window. Otherwise, subsequent messages would push the metadata out of the window, causing --resume to display an auto-generated title instead of the user's custom name. Yet another detail: the prompt cache break detector is notified to reset its baseline after compaction, preventing the compaction-induced drop in cache reads from being flagged as anomalous.

Design Philosophy

Cost
 ^
 |                                   * Full Compact ($$$, LLM summary)
 |
 |                       * Session Memory ($0, incremental notes)
 |
 |            * Time-Based MC ($0, content clearing)
 |
 |      * Cached MC (~$0, cache edit API)
 |
 | * Persist Large ($0, disk only)
 +-------------------------------------------------------->
                                            Compression Quality

The core idea behind the entire system can be summed up as "defer as long as possible, keep it as cheap as possible, escalate in stages." From zero-cost disk persistence, to near-zero-cost cache edits, to zero-LLM-cost Session Memory, and only then to a full LLM summary. Each layer tries to avoid invoking the next. An additional benefit of this layered strategy is fault tolerance: if one layer fails, the system naturally falls through to the next rather than erroring out.

Another notable design choice is the extreme deference to prompt cache. The entire raison d'être of Cached Microcompact is to avoid breaking the cached prefix. Full Compact shares the cache via a forked agent. Session Memory's background extraction also shares the cache via a fork. Time-Based Microcompact only modifies content after confirming the cache has expired. Behind almost every design decision lurks the question: "will this cause a cache miss?" In a world where tokens are billed by volume, prompt cache hit rates directly affect cost — and if a compaction system destroys the cache in order to save context, it's just robbing Peter to pay Paul.

Then there's a category of less glamorous but very practical engineering decisions: handling empty tool results. Some tools (silent shell commands, MCP servers returning content:[]) produce empty output. This seems harmless, but on certain models, an empty tool_result at the end of the prompt can cause the model to misinterpret it as a conversation boundary and stop responding prematurely. The system replaces all empty results with a (toolName completed with no output) marker, giving the model an unambiguous anchor.

The Session Memory design is especially worth savoring. It essentially amortizes the expensive operation of understanding session content across the entire session lifecycle — performing small incremental extractions at regular intervals, so that when compaction is finally needed, the payoff is a free summary. It transforms summarization from "how do I compress 200K of context" into "how do I maintain a set of always-up-to-date notes." If this sounds like the LLM equivalent of incremental backups, that's more or less the idea.

Of course, beyond this elegant cascade sits a blunt reality: 200K tokens simply isn't enough for heavy coding sessions, and compaction always means information loss. All of this engineering effort has a single ultimate goal: to make "not enough" happen a little later, cost a little less, and bother the user a little less. In a sense, this entire system is a love letter to the finitude of the context window — and the more beautifully it's written, the more it proves the problem has no real solution.