Deferred Tool Loading in Claude Code
When you use Claude Code, there’s something you probably never notice: it has over 40 registered tools, but when you ask it to read a file or edit a few lines of code, it only uses three or four. The definitions for the remaining 30-plus tools, each around 500 tokens, add up to over 10,000 tokens of fixed overhead per request. You just want to change one line of CSS, but you’re paying for WebSearch, NotebookEdit, CronCreate, and a bunch of tools you’ll never touch.
This problem has a well-known solution in traditional software: dynamic linking. Instead of loading every shared library at startup, the program waits until a function is actually called before loading its library. Claude Code does something similar, except it’s not managing memory address space. It’s managing token budget.
How Big Is the Problem
The more capable an LLM Agent becomes, the more tools it registers. File reading, writing, searching, and command execution are the basics. Add notebook editing, scheduled tasks, plan mode, web search, and various MCP extension tools, and you easily pass 40. Every API request must include the name, description, and parameter schema of every tool as part of the context sent to the model.
In a 200K context window, the deferred tool definitions alone account for nearly 7%. This isn’t a one-time cost; it’s paid every turn. And since Anthropic charges by the token, even with cache hits these definitions consume cache space, reducing caching efficiency for everything else.
More critically, the more tokens tool definitions consume, the less room there is for actual conversation and code. In long sessions, this 7% fixed overhead can be the difference between triggering an extra compression and not.
Two Classes of Tools
Claude Code’s solution is to split tools into two categories.
The first category is always-loaded tools, included with full definitions in every request:
These are high-frequency tools used in almost every task. Note that ToolSearch itself is always loaded because it’s the entry point for loading other tools. If it were deferred, nothing could load anything.
The second category is deferred tools. Only their names are sent, not their full definitions. The model knows these tools exist but can’t see their parameters or usage. This includes WebSearch, TodoWrite, NotebookEdit, all Cron tools, plan mode tools, and every MCP extension tool.
The heuristic is intuitive: tools likely needed in the first turn stay loaded; tools that might go unused the entire session are deferred. MCP tools are natural candidates for deferral. They’re configured per-project, their count is unpredictable, and some projects connect a dozen MCP servers with tool definitions easily exceeding 10,000 tokens.
The Discovery Mechanism
How does the model load a deferred tool? By calling ToolSearch.
Say the user asks “search for Claude Code release notes.” The model determines it needs WebSearch, but the context only contains the tool’s name with no parameter schema. The model first calls ToolSearch, specifying it wants WebSearch. ToolSearch returns a special reference marker. When the API server sees this marker, it injects WebSearch’s full definition into the model’s context. The model can then call WebSearch normally.
The flow looks roughly like this:
| |
The cost is one extra ToolSearch call, roughly 200 input tokens. But the savings are the fixed overhead of 30-plus tool definitions, over 10,000 tokens. As long as you’re not searching for new tools every turn, the net benefit is positive. And once a tool has been loaded, it stays loaded for subsequent requests.
ToolSearch also supports fuzzy search. When the model isn’t sure of a tool’s exact name, it can query by keyword. Searching “notebook jupyter” finds NotebookEdit, and “slack send” finds the corresponding MCP tool.
Surviving Compression
A previous post discussed Claude Code’s context compression mechanism. Compression condenses message history into summaries, but tool references loaded earlier get lost in the process. Without handling this, the model forgets it ever loaded WebSearch after compression and has to search for it again.
| |
Claude Code solves this with two mechanisms. The first records the list of loaded tools in the compaction boundary message. After compression, the system scans this boundary and restores the previous state. The second is incremental notification: before each turn, the system compares currently available deferred tools against what the model has already been told about. If anything changed, such as an MCP server disconnecting or a new one connecting, it generates a message to inform the model.
Together, these two layers ensure that the API-level tool filtering stays correct and the model’s awareness of available tools remains intact. Compression doesn’t break tool state.
Adaptive Activation
Not every scenario needs deferred loading. If a project only uses two or three MCP tools totaling under 1,500 tokens, the extra ToolSearch call is just wasted time.
Claude Code has an automatic mode: it calculates the total token count of all deferrable tool definitions, and if it exceeds 10% of the context window, deferred loading kicks in. Below that threshold, everything is inlined. Small projects with few tools run more efficiently with everything loaded; large projects with many tools save more with deferred loading. The system decides on its own. Users don’t need to think about it.
Third-party API proxies have this feature disabled by default, since the tool reference marker is an Anthropic-specific capability that proxies may not support. But if users confirm their proxy can handle it, they can enable it manually.
Why This Design Is Interesting
Deferred tool loading looks like a very specific engineering optimization, but the tension behind it is quite universal: the more capable an Agent becomes, the more tools it registers, the larger the context overhead, and the less room is left for actual work. Capability itself becomes a burden.
This tension showed up in traditional software long ago. The more functions a program can call, the larger and slower it gets if everything is bundled in. The solution was lazy loading: load each library only when it’s first called, trading one extra level of indirection for dramatically reduced resource consumption. What Claude Code does with tools is exactly the same thing, except it’s not saving memory. It’s saving tokens.