Lazy-Loading Tools in Claude Code: Discovery Over Declaration
Claude Code registers over 40 tools, yet the average request only uses 3–4 of them. Stuffing every tool's JSON Schema into the system prompt is expensive: each tool's description plus parameter definitions runs about 500 tokens, so 40 tools cost roughly 20K tokens. When all the user wants is to read a file, paying for WebSearch, NotebookEdit, and CronCreate — tools that have nothing to do with the task — is a bad deal.
Claude Code's answer is deferred loading: hide infrequently used tools, expose only their names, and let the model pull in full schemas on demand through a discovery tool. This turns the token cost of tool definitions from a fixed overhead into a pay-as-you-go expense.
Two Classes of Tools: Always-Loaded and Deferred
Judging from observed behavior, Claude Code splits its tools into two classes. The first is always-loaded tools whose full schemas ship with every API request:
- Bash, Read, Edit, Write, Glob, Grep
- Agent, ToolSearch, Skill
These are high-frequency tools that almost every task touches, or infrastructure the system itself depends on (ToolSearch, for instance, can't be deferred — otherwise nothing could load anything else).
The second class is deferred tools, for which only the name is sent — no full schema. Based on observed behavior, this class includes:
- WebSearch, WebFetch, TodoWrite, NotebookEdit
- EnterWorktree, ExitWorktree
- EnterPlanMode, ExitPlanMode
- CronCreate, CronDelete, CronList
- ConfigTool, LSPTool, AskUserQuestion
- All MCP tools (unless explicitly marked
alwaysLoad)
The heuristic is intuitive: tools that might be needed on the very first turn stay always-loaded; tools that might never be used in an entire session get deferred. MCP tools are natural candidates for deferral — they're configured per-project, their count is unbounded, and most requests never need to call an external service at all.
The Discovery Mechanism: ToolSearch and tool_reference
At the heart of deferred loading is the ToolSearch tool. When the model decides it needs a deferred tool, it calls ToolSearch to look it up. ToolSearch supports two query modes: exact selection and keyword search.
1 | ToolSearch("select:WebSearch") # exact selection |
The keyword scorer tokenizes tool names (splitting CamelCase or
mcp__server__action patterns), then matches query terms
against names and descriptions. An exact match on an MCP server name
carries the highest weight (12 points); a partial match on a tool name
scores 5–6 points; a word-boundary match in the description scores the
least (2 points). A + prefix marks a term as mandatory for
pre-filtering.
ToolSearch doesn't return the schema as text. Instead, it returns tool_reference content blocks:
1 | # ToolSearch returns this as tool_result |
tool_reference is an API-level protocol. The client sends this block; the API server expands it into the full tool schema within the model's context. The model can then call WebSearch just like any other tool. Crucially, schema expansion happens server-side — the client never needs to stuff the schema into the tool_result itself.
defer_loading: API-Level Coordination
In the API request's tools array, deferred tools carry a
defer_loading: true field:
1 | # Always-loaded tool (full schema) |
Note that deferred tools still include their full schema in the
tools array, but when the API server sees the
defer_loading flag, it withholds the schema from the model.
The model only knows the tool exists by name. This differs from not
sending the tool at all: the tool is still registered on the API side,
so the moment a tool_reference fires, the server can expand the full
definition immediately. Using this feature requires a special beta
header and is currently supported only by the first-party Anthropic API
and Foundry. Haiku does not support it.
Tracking Discovered Tools
Once a tool has been discovered, it needs to remain available for the rest of the conversation. The implementation scans the message history for tool_reference blocks:
1 | def extract_discovered_tools(messages): |
When building each API request, always-loaded tools are sent as usual. Among deferred tools, only those present in the discovered set are included. ToolSearch itself is always included. This way, each turn sends only the subset of tools the model actually needs.
Three Modes: Always-Defer, Auto, and Never-Defer
Deferred loading behavior is controlled by the
ENABLE_TOOL_SEARCH environment variable, which maps to
three modes:
| Value | Mode | Behavior |
|---|---|---|
| unset / true | tst | Always defer. MCP tools and built-in tools marked
shouldDefer are all deferred |
| auto / auto:N | tst-auto | Auto. Deferral kicks in only when the total tokens of deferrable tool schemas exceed N% of the context window (default 10%) |
| false | standard | Never defer. All tools are inlined; ToolSearch is not loaded |
Auto mode is an elegant design. It first uses the token-counting API to compute the total token cost of all deferrable schemas; if that API is unavailable, it falls back to dividing character count by 2.5 as an approximation. It then compares the result against the percentage threshold of the context window. A small project with two or three MCP tools totaling under 1,500 tokens? Not worth the overhead. A large project wired up to a dozen MCP servers where tool schemas easily blow past 10K tokens? Deferred loading pays for itself. Auto mode lets the system adapt instead of forcing a one-size-fits-all policy.
Recovery After Compaction
As conversations grow, Claude Code performs context compaction — compressing message history into a summary. The problem is that compaction destroys tool_reference blocks, causing previously discovered tools to vanish from subsequent requests. The solution has two layers.
The first layer is a compact boundary marker. Before
compaction runs, the system scans the current message history for all
discovered tool names and stores them in the boundary message's metadata
(preCompactDiscoveredTools). When
extract_discovered_tools later encounters this boundary
message, it adds those names back to the discovered set.
The second layer is the deferred_tools_delta attachment. This is an incremental notification mechanism: before each turn, the system diffs the current pool of deferrable tools against the set it has already notified the model about. If there are additions or removals, it injects a delta message into the conversation telling the model which deferred tools are now available and which are no longer available (e.g., because an MCP server disconnected). After compaction, the delta is regenerated, ensuring the model's awareness of available tools isn't broken by the compression.
1 | compact boundary: |
The two layers complement each other: the boundary ensures correct tool filtering at the API level; the delta ensures the model's awareness stays correct.
Special Handling for MCP Tools
MCP tools are deferred by default, but an MCP server can exempt a
tool by setting _meta["anthropic/alwaysLoad"] in its
definition. This is an escape hatch for critical tools the model needs
to know about from the very first turn.
From observed behavior, deferred loading of MCP tools has an additional wrinkle: MCP servers may connect asynchronously. When ToolSearch finds no matching tools, it checks whether any MCP servers are still in the process of connecting. If so, it tells the model "some MCP servers are still connecting, try searching again." This is a graceful retry hint — the model doesn't need to understand the MCP connection state machine; it just needs to know "nothing found yet, try again shortly."
The Full Picture
Putting all the pieces together, a typical deferred tool invocation looks like this:
1 | API Request (Turn 1) |
From a token-economics perspective, if 30 tools are deferred at 500 tokens each, that's 15K tokens saved from the system prompt. For a model with a 200K context window, that means 7.5% more usable context. The cost is one extra ToolSearch round-trip — roughly 200 input tokens plus a few dozen output tokens. As long as the model isn't searching for tools on every single turn, the net benefit is positive.
Edge Cases Worth Noting
A few details are worth calling out. First, third-party API proxies
(configured via ANTHROPIC_BASE_URL) have deferred loading
disabled by default, because tool_reference is a beta feature that
proxies may not recognize. However, if the user explicitly sets
ENABLE_TOOL_SEARCH=true, the system assumes the proxy can
handle it. Second, LSP tools have special logic: if the LSP is still
initializing, the tool is temporarily deferred even if it isn't marked
shouldDefer, and it reverts to normal once initialization
completes. Finally, sub-agent conversations (via the Agent tool) are
isolated — their ToolSearch discoveries don't propagate to the main
thread. This is also why deferred_tools_delta distinguishes call sites
for sub-agents: a sub-agent seeing prior=0 on its first
turn is expected; the main thread seeing prior=0 on its
first turn is likely a bug.
The fascinating thing about the entire deferred loading system is
that it's essentially reinventing dynamic linking — inside an LLM's
context window. A program doesn't load every shared library at startup;
it dlopens them on first call. Here the "address space" is
the token budget, the "symbol table" is the tool schema, and the
"dynamic linker" is ToolSearch. Back in the day, Unix used lazy loading
to save memory. Now Claude Code uses lazy loading to save tokens.
Computer science has no new problems — only new resource
constraints.