Lazy-Loading Tools in Claude Code: Discovery Over Declaration

Posted on 2026-04-05 Edited on 2026-04-06 In NLP 评论: Views:

Claude Code registers over 40 tools, yet the average request only uses 3–4 of them. Stuffing every tool's JSON Schema into the system prompt is expensive: each tool's description plus parameter definitions runs about 500 tokens, so 40 tools cost roughly 20K tokens. When all the user wants is to read a file, paying for WebSearch, NotebookEdit, and CronCreate — tools that have nothing to do with the task — is a bad deal.

Claude Code's answer is deferred loading: hide infrequently used tools, expose only their names, and let the model pull in full schemas on demand through a discovery tool. This turns the token cost of tool definitions from a fixed overhead into a pay-as-you-go expense.

Two Classes of Tools: Always-Loaded and Deferred

Judging from observed behavior, Claude Code splits its tools into two classes. The first is always-loaded tools whose full schemas ship with every API request:

Bash, Read, Edit, Write, Glob, Grep
Agent, ToolSearch, Skill

These are high-frequency tools that almost every task touches, or infrastructure the system itself depends on (ToolSearch, for instance, can't be deferred — otherwise nothing could load anything else).

The second class is deferred tools, for which only the name is sent — no full schema. Based on observed behavior, this class includes:

WebSearch, WebFetch, TodoWrite, NotebookEdit
EnterWorktree, ExitWorktree
EnterPlanMode, ExitPlanMode
CronCreate, CronDelete, CronList
ConfigTool, LSPTool, AskUserQuestion
All MCP tools (unless explicitly marked alwaysLoad)

The heuristic is intuitive: tools that might be needed on the very first turn stay always-loaded; tools that might never be used in an entire session get deferred. MCP tools are natural candidates for deferral — they're configured per-project, their count is unbounded, and most requests never need to call an external service at all.

The Discovery Mechanism: ToolSearch and tool_reference

At the heart of deferred loading is the ToolSearch tool. When the model decides it needs a deferred tool, it calls ToolSearch to look it up. ToolSearch supports two query modes: exact selection and keyword search.

1
2
3

ToolSearch("select:WebSearch")       # exact selection
ToolSearch("notebook jupyter")        # keyword search
ToolSearch("+slack send")             # require "slack", rank by "send"

The keyword scorer tokenizes tool names (splitting CamelCase or mcp__server__action patterns), then matches query terms against names and descriptions. An exact match on an MCP server name carries the highest weight (12 points); a partial match on a tool name scores 5–6 points; a word-boundary match in the description scores the least (2 points). A + prefix marks a term as mandatory for pre-filtering.

ToolSearch doesn't return the schema as text. Instead, it returns tool_reference content blocks:

# ToolSearch returns this as tool_result
{
    "type": "tool_result",
    "tool_use_id": "...",
    "content": [
        {"type": "tool_reference", "tool_name": "WebSearch"},
        {"type": "tool_reference", "tool_name": "WebFetch"}
    ]
}

tool_reference is an API-level protocol. The client sends this block; the API server expands it into the full tool schema within the model's context. The model can then call WebSearch just like any other tool. Crucially, schema expansion happens server-side — the client never needs to stuff the schema into the tool_result itself.

defer_loading: API-Level Coordination

In the API request's tools array, deferred tools carry a defer_loading: true field:

# Always-loaded tool (full schema)
{
    "name": "Read",
    "description": "Reads a file...",
    "input_schema": {"type": "object", "properties": {...}}
}

# Deferred tool (name only, schema hidden from model)
{
    "name": "WebSearch",
    "description": "...",
    "input_schema": {"type": "object", "properties": {...}},
    "defer_loading": True   # API hides this from model context
}

Note that deferred tools still include their full schema in the tools array, but when the API server sees the defer_loading flag, it withholds the schema from the model. The model only knows the tool exists by name. This differs from not sending the tool at all: the tool is still registered on the API side, so the moment a tool_reference fires, the server can expand the full definition immediately. Using this feature requires a special beta header and is currently supported only by the first-party Anthropic API and Foundry. Haiku does not support it.

Tracking Discovered Tools

Once a tool has been discovered, it needs to remain available for the rest of the conversation. The implementation scans the message history for tool_reference blocks:

def extract_discovered_tools(messages):
    discovered = set()
    for msg in messages:
        # compact boundary carries pre-compact state
        if msg.type == "system" and msg.subtype == "compact_boundary":
            carried = msg.compact_metadata.get("pre_compact_discovered_tools", [])
            discovered.update(carried)
            continue

        if msg.type != "user":
            continue
        for block in msg.content:
            if block.type == "tool_result":
                for item in block.content:
                    if item.type == "tool_reference":
                        discovered.add(item.tool_name)
    return discovered

When building each API request, always-loaded tools are sent as usual. Among deferred tools, only those present in the discovered set are included. ToolSearch itself is always included. This way, each turn sends only the subset of tools the model actually needs.

Three Modes: Always-Defer, Auto, and Never-Defer

Deferred loading behavior is controlled by the ENABLE_TOOL_SEARCH environment variable, which maps to three modes:

Value	Mode	Behavior
unset / true	tst	Always defer. MCP tools and built-in tools marked `shouldDefer` are all deferred
auto / auto:N	tst-auto	Auto. Deferral kicks in only when the total tokens of deferrable tool schemas exceed N% of the context window (default 10%)
false	standard	Never defer. All tools are inlined; ToolSearch is not loaded

Auto mode is an elegant design. It first uses the token-counting API to compute the total token cost of all deferrable schemas; if that API is unavailable, it falls back to dividing character count by 2.5 as an approximation. It then compares the result against the percentage threshold of the context window. A small project with two or three MCP tools totaling under 1,500 tokens? Not worth the overhead. A large project wired up to a dozen MCP servers where tool schemas easily blow past 10K tokens? Deferred loading pays for itself. Auto mode lets the system adapt instead of forcing a one-size-fits-all policy.

Recovery After Compaction

As conversations grow, Claude Code performs context compaction — compressing message history into a summary. The problem is that compaction destroys tool_reference blocks, causing previously discovered tools to vanish from subsequent requests. The solution has two layers.

The first layer is a compact boundary marker. Before compaction runs, the system scans the current message history for all discovered tool names and stores them in the boundary message's metadata (preCompactDiscoveredTools). When extract_discovered_tools later encounters this boundary message, it adds those names back to the discovered set.

The second layer is the deferred_tools_delta attachment. This is an incremental notification mechanism: before each turn, the system diffs the current pool of deferrable tools against the set it has already notified the model about. If there are additions or removals, it injects a delta message into the conversation telling the model which deferred tools are now available and which are no longer available (e.g., because an MCP server disconnected). After compaction, the delta is regenerated, ensuring the model's awareness of available tools isn't broken by the compression.

compact boundary:
  preCompactDiscoveredTools: ["WebSearch", "mcp__slack__send_message"]

post-compact delta attachment (system-reminder):
  "The following deferred tools are now available via ToolSearch:
   mcp__github__create_issue
   mcp__slack__send_message"

The two layers complement each other: the boundary ensures correct tool filtering at the API level; the delta ensures the model's awareness stays correct.

Special Handling for MCP Tools

MCP tools are deferred by default, but an MCP server can exempt a tool by setting _meta["anthropic/alwaysLoad"] in its definition. This is an escape hatch for critical tools the model needs to know about from the very first turn.

From observed behavior, deferred loading of MCP tools has an additional wrinkle: MCP servers may connect asynchronously. When ToolSearch finds no matching tools, it checks whether any MCP servers are still in the process of connecting. If so, it tells the model "some MCP servers are still connecting, try searching again." This is a graceful retry hint — the model doesn't need to understand the MCP connection state machine; it just needs to know "nothing found yet, try again shortly."

The Full Picture

Putting all the pieces together, a typical deferred tool invocation looks like this:

       API Request (Turn 1)
       +------------------------------------------+
       | tools: [                                 |
       |   {name: "Bash", ...full schema...},     |
       |   {name: "Read", ...full schema...},     |
       |   {name: "ToolSearch", ...full schema...}|
       |   {name: "WebSearch", defer_loading: true}|
       |   {name: "mcp__slack__post", defer: true}|
       | ]                                        |
       +------------------------------------------+
                 |
                 v
       Model sees: Bash, Read, ToolSearch
       Model sees names: WebSearch, mcp__slack__post
                 |
                 v
User: "search for Claude Code release notes"
                 |
                 v
       Model calls: ToolSearch("select:WebSearch")
                 |
                 v
       ToolSearch returns:
       {type: "tool_reference", tool_name: "WebSearch"}
                 |
                 v
       API expands WebSearch schema in model context
                 |
                 v
       Model calls: WebSearch("Claude Code release notes")
                 |
                 v
       Turn 2: WebSearch included (discovered),
               mcp__slack__post still deferred

From a token-economics perspective, if 30 tools are deferred at 500 tokens each, that's 15K tokens saved from the system prompt. For a model with a 200K context window, that means 7.5% more usable context. The cost is one extra ToolSearch round-trip — roughly 200 input tokens plus a few dozen output tokens. As long as the model isn't searching for tools on every single turn, the net benefit is positive.

Edge Cases Worth Noting

A few details are worth calling out. First, third-party API proxies (configured via ANTHROPIC_BASE_URL) have deferred loading disabled by default, because tool_reference is a beta feature that proxies may not recognize. However, if the user explicitly sets ENABLE_TOOL_SEARCH=true, the system assumes the proxy can handle it. Second, LSP tools have special logic: if the LSP is still initializing, the tool is temporarily deferred even if it isn't marked shouldDefer, and it reverts to normal once initialization completes. Finally, sub-agent conversations (via the Agent tool) are isolated — their ToolSearch discoveries don't propagate to the main thread. This is also why deferred_tools_delta distinguishes call sites for sub-agents: a sub-agent seeing prior=0 on its first turn is expected; the main thread seeing prior=0 on its first turn is likely a bug.

The fascinating thing about the entire deferred loading system is that it's essentially reinventing dynamic linking — inside an LLM's context window. A program doesn't load every shared library at startup; it dlopens them on first call. Here the "address space" is the token budget, the "symbol table" is the tool schema, and the "dynamic linker" is ToolSearch. Back in the day, Unix used lazy loading to save memory. Now Claude Code uses lazy loading to save tokens. Computer science has no new problems — only new resource constraints.