How Claude Code's Forked Sub-Agents Share Prompt Cache

Posted on 2026-04-05 Edited on 2026-04-06 In NLP 评论: Views:

When tackling complex tasks, Claude Code spawns multiple sub-agents in parallel, each of which needs the full parent conversation context to do its job effectively. This creates a real cost problem: if the parent conversation has accumulated 100K tokens of context and three sub-agents are spawned simultaneously, a naive implementation would charge 100K tokens of input for each one — 300K total. Anthropic's API offers a Prompt Cache mechanism that gives a 90% discount on the cached prefix portion, but only if the prefix bytes are exactly identical across requests. From observable behavior, Claude Code's forked sub-agents are carefully constructed so that over 99% of the bytes are identical across all parallel forks, compressing the effective input cost of three sub-agents to roughly 120K token-equivalent (100K at full price + 2 × 100K × 10%).

How Prompt Cache Works

A quick refresher on Anthropic's prompt cache. When you send an API request, you can tag a content block with cache_control: {type: "ephemeral"}, telling the server to cache everything up to that point. If a subsequent request has a byte-identical prefix, it hits the cache, and the cached tokens are billed at 10% of the normal price. The cache key is determined by several dimensions together: the full content of the system prompt, the JSON serialization of the tools definitions, the model name, the prefix of the messages array, and the thinking config. A byte-level difference in any single dimension causes a cache miss.

"Byte-level identical" is the key phrase here. Not semantically equivalent, not logically equal — the JSON byte sequence on the wire must be exactly the same. An extra space, a change in field ordering, a boolean that differs because a feature flag was read at a different moment — any of these will invalidate the cache. This constraint shapes the entire fork cache-sharing design.

Two Sub-Agent Paths

From observable behavior, Claude Code has two distinct sub-agent spawning paths. The first is named sub-agents, including Explore and Plan, which have their own system prompts, use smaller or cheaper models (Explore uses Haiku for external users), carry a trimmed tool set (read-only tools, no file editing), and omit the project conventions from CLAUDE.md to save tokens. The second is fork sub-agents, which inherit the parent agent's full conversation context, full tool set, and same model — essentially clones of the parent, each executing a different directive.

The caching strategies for these two paths are completely different. Named sub-agents have their own independent cache chains; their system prompts, tool definitions, and models all differ from the parent's, so cache sharing with the parent is impossible. Fork sub-agents take a different approach: they are deliberately constructed to share a single cache prefix with both the parent and each other. This doesn't happen naturally — it requires intentional alignment at every stage of API request construction.

Constructing Byte-Identical Prefixes

This is the core of the entire design. When the parent agent emits three parallel tool calls (tool_use) in a single response, Claude Code constructs the following message sequence for each fork sub-agent. Pay attention to the similarities and differences among the three forks:

Fork 1:
  [...parent_history,
   assistant { tool_use_1, tool_use_2, tool_use_3 },
   user { tool_result_1="Fork started...",
          tool_result_2="Fork started...",
          tool_result_3="Fork started...",
          <fork-directive> directive_1 </fork-directive> }]

Fork 2:
  [...parent_history,
   assistant { tool_use_1, tool_use_2, tool_use_3 },
   user { tool_result_1="Fork started...",
          tool_result_2="Fork started...",
          tool_result_3="Fork started...",
          <fork-directive> directive_2 </fork-directive> }]

Fork 3:
  [...parent_history,
   assistant { tool_use_1, tool_use_2, tool_use_3 },
   user { tool_result_1="Fork started...",
          tool_result_2="Fork started...",
          tool_result_3="Fork started...",
          <fork-directive> directive_3 </fork-directive> }]

Except for the directive text block at the very end, the entire message sequence is byte-for-byte identical across all three forks. In pseudocode:

FORK_PLACEHOLDER_RESULT = "Fork started - processing in background"

def build_forked_messages(directive, assistant_message):
    # Clone the full assistant message (all tool_use blocks, thinking, text)
    full_assistant = clone(assistant_message)

    # Collect ALL tool_use blocks from the assistant message
    tool_use_blocks = [b for b in assistant_message.content
                       if b.type == "tool_use"]

    # Build IDENTICAL placeholder results for every tool_use
    tool_results = [
        {"type": "tool_result",
         "tool_use_id": block.id,
         "content": [{"type": "text",
                       "text": FORK_PLACEHOLDER_RESULT}]}
        for block in tool_use_blocks
    ]

    # Single user message: all placeholder results + per-child directive
    user_message = create_user_message(
        content=[*tool_results,
                 {"type": "text",
                  "text": build_child_directive(directive)}]
    )

    return [full_assistant, user_message]

Several design decisions in this logic are worth calling out.

First, the placeholder text FORK_PLACEHOLDER_RESULT is a constant string — every tool_result across every fork uses the exact same text. You might wonder: why not put the actual result for each fork's corresponding tool_use in there? Because forks are launched in parallel. When Fork 1 starts executing, Fork 2 and Fork 3's results don't exist yet. And even if they did, inserting different results would break prefix consistency. So a constant placeholder is used, identical across all forks.

Second, the parent's assistant message is preserved in full, including all tool_use blocks — not just the one corresponding to the current fork. This seems counterintuitive at first: Fork 1 logically only needs to execute directive_1, so why does it need to see tool_use_2 and tool_use_3? The answer, again, is caching. If Fork 1 only saw tool_use_1 while Fork 2 only saw tool_use_2, the three forks would diverge starting from the assistant message, and prefix consistency would break right there. Retaining all tool_use blocks is a necessary condition for cache sharing.

Third, all fork sub-agents share the same number and ordering of tool_results. Regardless of which of the three tool_uses a fork actually cares about, its user message contains all three tool_results. This ensures every byte from the tool_result sequence up to the directive is identical.

One more detail: the fork's directive text is wrapped in an XML tag <fork-boilerplate>, which contains a fixed instruction template telling the sub-agent that it's a forked worker, that it shouldn't spawn further sub-agents, shouldn't engage in small talk, and should use tools to execute its task and then report back. This template is the same across all forks. The only difference is the specific task instruction appended at the end of the template. This design further compresses the divergence zone: even if the directive contains several hundred tokens of template text, all of it is shared, and the actual difference is just the last few dozen tokens of the specific task description.

Five-Dimensional Cache Alignment

The prompt cache key is composed of five dimensions: system prompt, tools definition, model, messages prefix, and thinking config. Message prefix consistency is solved by the construction above, but the other four dimensions also need to be aligned one by one. If even a single byte differs in any dimension, the cache is completely invalidated — there's no such thing as a partial hit.

System prompt: Fork sub-agents don't use their own system prompt; instead, they directly use the bytes of the system prompt already rendered by the parent agent. From observable behavior, there's a subtle consideration here: system prompt rendering may depend on feature flag state (e.g., GrowthBook), and flag state could change between the start of the parent's turn and the creation of a fork sub-agent. If the fork sub-agent re-rendered the system prompt, even with the same logic, a change in flag state could produce different bytes. So the correct approach is to pass the parent's already-rendered bytes directly, rather than re-rendering. There's a fallback path in the code that recomputes when the pre-rendered bytes are unavailable, but it's annotated as potentially causing cache invalidation.

Tools definition: Fork sub-agents receive the parent agent's complete tool pool via a useExactTools=true flag, with no filtering or reordering. Regular sub-agents would reassemble their tool set based on their own permissionMode, and the resulting serialization could break the cache at the first point of difference. The fork path skips this step entirely and uses the parent's tools array reference directly.

ThinkingConfig: Fork sub-agents inherit the parent's thinking configuration. This config is part of the cache key. From observable behavior, if a fork sub-agent set a different maxOutputTokens, it could indirectly change budget_tokens (through downstream clamping logic), causing the byte representation of the thinking config to change and thus invalidating the cache. So the fork path doesn't set maxOutputTokens — it inherits the full configuration as-is.

ContentReplacementState: This dimension is a bit more subtle. Claude Code has a tool result budget management mechanism that truncates or replaces overly long tool results. These replacement decisions affect the message content actually sent over the wire — for the same tool_use_id, a fresh state and a cloned state might make different replacement decisions. Fork sub-agents clone the parent's replacement state (rather than creating a fresh one), so that when processing inherited parent messages, they make identical replacement decisions for the same tool_use_ids, keeping the wire prefix consistent. In the code's own words: "A fresh state would see them as unseen and make divergent replacement decisions, a clone makes identical decisions."

The CacheSafeParams Pattern

Parallel forks aren't the only scenario that needs cache sharing. After each turn of the main loop, Claude Code also launches sidecar forks — session memory extraction, prompt suggestion generation, /btw side-questions, and so on. These tasks also need to piggyback on the main loop's cache as much as possible. From observable behavior, the system saves a cache-safe parameter snapshot after each main loop turn:

class CacheSafeParams:
    system_prompt: SystemPrompt
    user_context: dict
    system_context: dict
    tool_use_context: ToolUseContext
    fork_context_messages: list[Message]

# Global slot, written after each main-loop turn
last_cache_safe_params: CacheSafeParams | None = None

def save_cache_safe_params(params):
    global last_cache_safe_params
    last_cache_safe_params = params

def get_last_cache_safe_params():
    return last_cache_safe_params

This global slot is updated in the stop hooks of each main loop turn. Subsequent sidecar forks read from this snapshot to construct their API requests, rather than each caller independently re-collecting the parameters. This avoids tiny discrepancies caused by different collection times — for instance, the system context might change between two collection passes. Only queries originating from the main thread (repl_main_thread) or the SDK write to this snapshot; sub-agents' own turns do not overwrite it.

A notable design point is that the forkContextMessages in CacheSafeParams is a reference to the main loop's message array. Sidecar forks clone the necessary mutable state via createSubagentContext when they use it, but the message prefix itself remains a shared reference, ensuring byte consistency.

The practical problem this pattern solves is timing window divergence. Suppose the main loop finishes a turn at T0, session memory extraction starts at T0+100ms, and prompt suggestion generation starts at T0+200ms. If each independently collected the system prompt and context, feature flags might change between T0 and T0+200ms, GrowthBook configs might update, or the system prompt template itself might be hot-reloaded. Using a single snapshot eliminates this timing window risk entirely.

Recursive Fork Prevention

Fork sub-agents retain the Agent tool in their tool pool. This isn't so they can actually fork again — it's to maintain byte consistency in the tool definitions. If the Agent tool were removed from a fork sub-agent's tool pool, the tool schema would differ from the parent's, and the cache prefix wouldn't align on the tools dimension.

Recursive fork detection works in two layers. The first layer checks whether querySource matches the fork agent's identifier — this check survives autocompact (automatic compaction rewrites messages but doesn't change context.options, because querySource is set on options at spawn time). The second layer scans message history for the fork boilerplate tag, serving as a fallback in case querySource wasn't properly propagated:

def is_in_fork_child(messages):
    """Check if we're already inside a fork child."""
    for msg in messages:
        if msg.type != "user":
            continue
        for block in msg.content:
            if block.type == "text" and "<fork-boilerplate>" in block.text:
                return True
    return False

# In AgentTool.call():
def handle_agent_call(input, context):
    effective_type = input.subagent_type or (
        None if is_fork_enabled() else "general-purpose"
    )
    is_fork_path = effective_type is None

    if is_fork_path:
        # Primary guard: querySource (survives autocompact)
        if context.options.query_source == "agent:builtin:fork":
            raise Error("Fork not available inside a forked worker")
        # Fallback guard: message scan
        if is_in_fork_child(context.messages):
            raise Error("Fork not available inside a forked worker")

The existence of two layers of defense highlights autocompact as a scenario requiring special attention. When a conversation grows long enough to trigger automatic compaction, messages get rewritten, and the fork boilerplate tag may be compressed away. If detection relied solely on message scanning, an autocompacted fork sub-agent could bypass the recursion check. So querySource, as a property on options (unaffected by message rewriting), serves as the primary detection mechanism, with message scanning demoted to a fallback. This is a good example of defense in depth: each layer has blind spots, but together they cover all paths.

Cost Optimization for Named Sub-Agents

Named sub-agents take a completely different optimization path. Rather than trying to share cache with the parent, they reduce absolute cost by trimming their context. The idea is straightforward: if a sub-agent doesn't need the full context, don't give it the full context.

The Explore agent uses the Haiku model for external users, retains only read-only tools (Glob, Grep, Read, Bash), and doesn't load the CLAUDE.md project conventions. From observable behavior, stripping CLAUDE.md saves a considerable amount of tokens at fleet scale, presumably because CLAUDE.md typically contains thousands to tens of thousands of tokens of project conventions, and Explore — as a read-only search agent — has no need for coding style guides or CI pipeline details. This optimization has a kill switch via feature flag, indicating it was validated through A/B testing.

Similarly, both Explore and Plan omit gitStatus from the system context. This information is collected at session start, can be tens of thousands of tokens, and may be stale by the time a sub-agent runs. If sub-agents need git information, they can run git status themselves to get real-time data. Dropping a stale, large text block both saves tokens and prevents outdated information from misleading sub-agent decisions.

These two optimization strategies — fork cache sharing vs. named agent context trimming — suit different scenarios. Forks are ideal for parallel tasks that need the full context, such as simultaneously modifying multiple related files where sub-agents need to understand the entire conversation history to make correct editing decisions. Named agents are ideal for well-scoped single tasks like searching the codebase or drafting a plan, where the full parent conversation history isn't needed, and a smaller model with less context can get the job done. From a cost perspective, named agents are cheaper per individual call (smaller model + shorter context), while forks amortize marginal cost across multiple parallel calls through cache sharing.

Putting all the mechanisms together, here's what a typical cache-sharing flow looks like for a parallel fork execution:

Parent turn N:
  API Request:
    system_prompt  [cached from turn N-1]
    tools          [cached from turn N-1]
    messages[0..K] [cached from turn N-1]
    messages[K+1]  [new user message, cache_control: ephemeral]
  Response:
    assistant { thinking, text, tool_use_1, tool_use_2, tool_use_3 }

Fork 1 (first request):
  API Request:
    system_prompt  [CACHE HIT - same bytes as parent]
    tools          [CACHE HIT - same tool pool, useExactTools]
    messages[0..K] [CACHE HIT - same parent history]
    messages[K+1]  [CACHE HIT - same user message]
    messages[K+2]  [assistant with all 3 tool_uses]
    messages[K+3]  [placeholder results + directive_1]
  -> Pays: full price for messages[K+2..K+3], 90% off on [0..K+1]

Fork 2 (second request, moments later):
  API Request:
    system_prompt  [CACHE HIT]
    tools          [CACHE HIT]
    messages[0..K+2] [CACHE HIT, including the assistant message]
    messages[K+3]    [placeholder results + directive_2]
  -> Pays: full price only for the directive_2 tail,
           90% discount on everything including messages[K+2]

Fork 3 (third request):
  -> Same pattern as Fork 2, nearly everything cached

The first fork creates the cache entry; all subsequent forks hit it. The key insight is that Fork 2 and Fork 3 can even cache-hit on the assistant message and placeholder results that Fork 1 added, because this content is byte-identical across all forks — only the last few dozen tokens of the directive differ. This incremental caching effect means that adding more forks barely increases marginal cost.

Let's do the math. Suppose parent_history is 100K tokens, the assistant message with three tool_uses is 500 tokens, the placeholder results total 200 tokens, and each directive is about 100 tokens. Fork 1 pays full price for the assistant message + placeholders + directive (~800 tokens), with the preceding 100K hitting cache at 10%. Fork 2 and Fork 3 cache-hit even on the assistant message and placeholders, paying full price only for the 100-token directive. The total input cost across three forks is roughly: 100K × 10% × 3 (cached portion) + 800 (Fork 1's new content) + 100 × 2 (Fork 2/3 directives) = ~31K token-equivalent. Compared to 300K+ without cache sharing, that's nearly 90% savings.

This benefit scales with conversation context size. When parent_history grows from 100K to 200K, the absolute token savings from cache sharing double as well. Given that Claude Code sessions routinely reach tens of thousands to over a hundred thousand tokens of context, the cost impact of fork cache sharing is quite significant. Conversely, without this mechanism, frequent parallel forks would cause API costs to scale linearly, quickly becoming prohibitive in long conversations.

Isolation Model: Shared Prefix, Isolated Runtime

Sharing cache doesn't mean sharing state. Each fork sub-agent gets a fully isolated context at execution time: the file state cache is cloned (not a shared reference), the abort controller is freshly created (but linked to the parent's — parent abort propagates down), and mutation callbacks like setAppState default to no-ops. Multiple forks don't interfere with each other; one fork modifying a file won't pollute another fork's readFileState cache.

After a fork completes, isolated resources are explicitly released: readFileState.clear() empties the cloned file cache, and initialMessages.length = 0 frees the cloned context message array. This matters for long sessions — otherwise, every fork would leave behind a memory footprint from a 100K+ token message copy. Background shell tasks registered by fork sub-agents, Perfetto tracing entries, todo state, and other resources are also cleaned up in a finally block to prevent resource leaks in long sessions.

Forks also support a worktree isolation mode, which creates an independent git worktree for the sub-agent. In this case, an additional note is appended to the sub-agent's messages, informing it that it's working in a worktree, that it needs to translate inherited paths to the worktree root, and that it should re-read files before editing (since the parent agent may have modified them). This note is appended after the fork directive, so it doesn't affect the common prefix shared among all forks.

One more isolation detail worth mentioning: although a fork sub-agent's abort controller is freshly created, it's linked to the parent's controller, so an abort signal propagates downward. This means if the user interrupts at the parent agent level, all parallel forks receive the signal. But the reverse isn't true: a fork aborting on its own doesn't affect other forks or the parent. This unidirectional propagation makes sense for parallel execution — users need the ability to kill all sub-tasks at once, but a single sub-task's failure shouldn't cascade to others.

From observable behavior, a fork sub-agent's results are presented to the parent agent as notifications. Enabling forks also changes the overall behavior of the Agent tool: all agent spawns become asynchronous, using a unified task-notification interaction model. This means the parent agent doesn't block waiting after issuing fork directives — it continues processing other work or waits for all forks to complete. This design choice makes forks naturally suited for parallel scenarios, but it also means fork sub-agent results aren't inserted directly into the parent's conversation flow the way synchronous sub-agent results would be.

A Design Observation

The entire fork cache-sharing mechanism is essentially solving a constraint satisfaction problem: the API's cache key requires byte-level identity, while fork sub-agents inherently need differentiated directives. The solution is to push all differences to the very end of the message sequence — a region two to three orders of magnitude smaller than the prefix. It's reminiscent of clustered index design in databases: put the high-cardinality column last so the common prefix is as long as possible.

Abstracting this design principle, it's really a general multi-tenant cache-sharing pattern: when N requests share a long prefix and differ only in a short tail, push the differences to the end and let the cache cover all the common content. You can find echoes of this pattern in CDN vary-header design, database query plan caching, and even CPU instruction cache prefetch strategies. What makes Claude Code's implementation distinctive is that it must maintain consistency across five independent dimensions simultaneously (system prompt, tools, model, messages, thinking config) — a discrepancy in any single dimension invalidates the entire cache. This makes it more like walking a tightrope in five-dimensional space.

For developers building their own LLM agent systems, there's a directly applicable design principle here: if your system needs to issue multiple LLM calls in parallel that share substantial common context, it's worth investing the effort to ensure those calls have byte-identical API request prefixes. Concretely: use constants for all placeholder content, avoid re-rendering dynamic sections that might produce differences, retain rather than trim shared schemas, and clone rather than reconstruct stateful processors. The prompt cache discount (90%) is large enough that even relatively involved alignment work delivers excellent ROI.

To keep tool definitions byte-identical, fork sub-agents retain the Agent tool but are forbidden from calling it. A tool's presence exists purely for the bytes it produces during schema serialization — completely unrelated to its functionality. In this system, a tool can have zero chance of being invoked, but it cannot lack the necessity of being serialized. This is perhaps the ultimate expression of what byte-level compatibility demands of engineering aesthetics.