Dissecting Claude Code's RAG Mechanism
Claude Code has no vector database and no embedding index, yet it can pinpoint the exact file you need in a million-line codebase. Behind this is a retrieval architecture completely different from traditional RAG.
This Isn't the RAG You Know
If you've used RAG before, the pipeline should be familiar: build an offline index, user asks a question, vector-search for Top-K chunks, inject into prompt, generate an answer. A straight line, one pass, done.
Claude Code doesn't work like that at all. It has no offline index. The model itself drives the retrieval process.
Traditional RAG:
+--------+ +-----------+ +----------+ +--------+
| Query | --> | Vector DB | --> | Top-K | --> | LLM |
| | | (offline | | chunks | | answer |
| | | indexed) | | injected | | |
+--------+ +-----------+ +----------+ +--------+
One-shot: retrieve once, generate once.
Claude Code (Agentic RAG):
+--------+ +------------------+ +--------+
| Query | --> | LLM decides what | --> | Tool |
| | | to search for | | result |
+--------+ +-------+----------+ +---+----+
^ |
| not enough? |
+--------------------+
loop until satisfied
Multi-hop: model drives retrieval in a loop.
Traditional RAG has a fixed retrieval strategy, hardcoded in the pipeline. Claude Code's retrieval strategy is dynamic: the model decides what to search for, how many times, and which tool to use, all based on the current context.
Four-Layer Retrieval Architecture
Claude Code's context isn't assembled all at once. It's injected progressively across four layers:
+===============================================================+
| CONTEXT WINDOW |
+===============================================================+
| |
| Layer 0: STATIC CONTEXT (loaded once at session start) |
| +---------------------------------------------------------+ |
| | System Prompt | CLAUDE.md | Git Status | Memory Index | |
| +---------------------------------------------------------+ |
| |
| Layer 1: SMART PRE-INJECTION (before model sees the query) |
| +---------------------------------------------------------+ |
| | Sonnet Memory Recall | @file mentions | Skill Discovery | |
| +---------------------------------------------------------+ |
| |
| Layer 2: MODEL-DRIVEN RETRIEVAL (tool use loop) |
| +---------------------------------------------------------+ |
| | Glob -> Grep -> Read -> ... (model decides) | |
| +---------------------------------------------------------+ |
| |
| Layer 3: DELEGATED RETRIEVAL (sub-agents) |
| +---------------------------------------------------------+ |
| | Explore Agent | Fork Agent (parallel research) | |
| +---------------------------------------------------------+ |
| |
+===============================================================+
Layer 0: Static Context
At the start of every session, the system automatically loads a set of resident context: the System Prompt (tool usage guide, behavioral rules, environment info), CLAUDE.md (project-level configuration), Git Status (current branch, last 5 commits), and the MEMORY.md index file.
There's a key design detail here: the System Prompt is split into a static region and a dynamic region, separated by a boundary marker.
system_prompt = [
# --- Static (cross-session cacheable) ---
intro_section, # identity & rules
system_section, # tool instructions
doing_tasks_section, # task guidelines
actions_section, # safety rules
using_tools_section, # tool selection guide
tone_and_style, # output style
output_efficiency, # brevity rules
"__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__", # <-- cache boundary
# --- Dynamic (changes per session) ---
session_guidance, # session-specific guidance
memory_prompt, # persistent memory
env_info, # environment info
language, # language preference
mcp_instructions, # MCP server instructions
]
Everything before the boundary marker can share Prompt Cache across sessions, with all users sharing the same cached copy. Everything after differs per session. The static portion costs essentially nothing.
Layer 1: Smart Pre-Injection
After the user sends a message but before the model starts reasoning, the system runs a batch of pre-retrieval operations in parallel:
User sends message: "Fix the auth bug in payment API"
|
v (parallel, before model sees anything)
+-----------+-----------+-----------+
| | | |
v v v v
@mention Memory Skill Agent
file scan Prefetch Discovery Listing
| | | |
+-----+-----+-----+-----+-----+----+
| |
v v
[Attachment Messages injected into conversation]
The most elegant piece is Memory Prefetch. It uses the Sonnet model to run a lightweight sideQuery, selecting up to 5 relevant files from persistent memory:
def prefetch_relevant_memories(user_query, messages):
already_surfaced = collect_surfaced_memories(messages)
recent_tools = collect_recent_successful_tools(messages)
selected = sonnet_side_query(
system="Select up to 5 memories useful for the query",
user=f"Query: {user_query}\nAvailable: {memory_manifest}",
recent_tools=recent_tools,
)
return load_selected_memories(selected)
The sideQuery is a key pattern in Claude Code: outside the main loop, a smaller model makes auxiliary decisions, and the results are injected as attachments into the next conversation turn. The cost is minimal (Sonnet is much cheaper than Opus), but context relevance improves dramatically.
Layer 2: Model-Driven Retrieval
This is the core of Agentic RAG. The model has three search tools and autonomously decides the calling order and frequency:
| Tool | Capability | Typical Use |
|---|---|---|
| Glob | Match files by name pattern | src/**/*.ts, find file structure |
| Grep | Search content by regex | function handleAuth, find implementation |
| Read | Read file content | Read target file precisely |
The model typically narrows scope in a funnel pattern of Glob → Grep → Read:
Model thinks: "I need to find the auth handler"
|
v
Glob("src/**/*auth*.ts") --> 8 files found
|
v
Grep("handleAuth", path="src/") --> 3 files match
|
v
Read("src/api/auth/handler.ts") --> full content
|
v
(enough context, start working)
The key point is that this isn't a fixed pipeline. The model might Read directly (if the user provided a file path), Grep multiple times (first attempt missed, try different keywords), or launch multiple parallel searches simultaneously. This flexibility is something traditional RAG simply can't match.
Layer 3: Sub-Agent Delegated Retrieval
When the search task is heavy (e.g., "help me understand this project's authentication architecture"), the model can spin up an Explore Agent to do the work:
Main Agent context: Explore Agent context:
+---------------------------+ +---------------------------+
| User query | | Search directive |
| ... (valuable context) | | Glob, Grep, Read results |
| | | ... (raw search output) |
| [Agent tool call] -------|----> | ... (lots of content) |
| | +---------------------------+
| [Agent result: summary] <-|---- | Final summary (concise) |
| | +---------------------------+
| (context stays clean!) |
+---------------------------+
The Explore Agent's design is deliberately restrained: read-only (can't create, modify, or delete any files), runs on Haiku (the fastest model), keeps all raw search results in its own context, returns only a refined summary, and doesn't even load CLAUDE.md (the main Agent already has it).
This solves a core problem with Agentic RAG: the search process itself consumes massive context. If all Glob/Grep/Read results stayed in the main context, a few rounds of searching would fill it up. The sub-agent acts as a context firewall.
Engineering Highlights
Search Result Token Budget Controls
Every search tool has a result trimming mechanism to prevent a single search from flooding the context:
+-- Search Result Budget Controls --+
| |
| Grep: |
| Default head_limit = 250 lines |
| Max result size = 20KB chars |
| Max line width = 500 chars |
| (base64/minified auto-trimmed) |
| |
| Glob: |
| Max 100 files returned |
| Sorted by mtime (newest first) |
| Paths relativized to save toks |
| |
| Read: |
| Max 25,000 tokens per read |
| Max 256KB file size |
| Default 2000 lines |
| Supports offset + limit paging |
| |
+-----------------------------------+
Grep's head_limit=250 is a well-chosen default: large enough to cover the vast majority of exploratory searches, small enough to avoid dumping 6000+ tokens of search results in one shot. If the model knows it needs more, it can pass head_limit=0 (unlimited) or paginate with offset.
Read Tool Deduplication
Reading the same file twice is common — first discovered via search, then re-read before editing to confirm. Claude Code detects this automatically:
def read_file(path, offset, limit):
existing = read_file_state.get(path)
if existing and existing.offset == offset and existing.limit == limit:
mtime = get_file_mtime(path)
if mtime == existing.timestamp:
return "File unchanged since last read."
# saves ~25K tokens per hit
content = read_file_content(path, offset, limit)
read_file_state.set(path, content, mtime, offset, limit)
return content
Roughly 18% of Read calls hit the dedup cache, saving an entire file's worth of tokens each time.
Path Relativization
An inconspicuous but ubiquitous optimization: all absolute paths in search results are converted to relative paths.
# Before (absolute paths waste tokens):
/Users/john/projects/my-app/src/components/auth/LoginForm.tsx
/Users/john/projects/my-app/src/components/auth/AuthProvider.tsx
# After (relative to cwd):
src/components/auth/LoginForm.tsx
src/components/auth/AuthProvider.tsx
Seems trivial, but across a session with dozens of searches, this saves thousands of tokens.
System Prompt Cache Partitioning
The __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ mentioned earlier
works with Anthropic API's scope: 'global' caching strategy to share the
static portion across all users:
API Request:
+--------------------------------------------------+
| system[0]: intro + tools + rules | scope: 'global'
| system[1]: actions + style | (shared across ALL users)
| |
| ---- DYNAMIC BOUNDARY ---- |
| |
| system[2]: session guidance | scope: 'session'
| system[3]: memory + env info | (per-user)
+--------------------------------------------------+
| messages: [user, assistant, ...] |
+--------------------------------------------------+
Everything before the boundary (tool descriptions, behavioral rules, etc.) is identical across all users and only needs to be cached once. Your very first request might hit a system prompt prefix that someone else already cached. This is deduplication at the global level.
End-to-End Flow
SESSION START
|
+-- Load system prompt (static + dynamic)
+-- Load CLAUDE.md into user context
+-- Load MEMORY.md index
+-- Snapshot git status + recent commits
|
USER SENDS: "Fix the auth bug in payment API"
|
+-- [Parallel prefetch, before model runs]
| +-- Sonnet memory recall -> api_gotchas.md, user_prefs.md
| +-- @file scan -> (none mentioned)
| +-- Skill discovery -> (no matching skills)
|
+-- Prefetch results injected as attachment messages
|
MODEL TURN 1:
| Thinks: "I need to find auth-related files"
| +-- Grep("auth.*bug|payment.*auth", type="ts") --> 5 files
| +-- Read("src/api/payment/auth.ts") --> 800 lines
| +-- Read("src/api/payment/__tests__/auth.test.ts") --> 400 lines
|
MODEL TURN 2:
| Thinks: "Found the bug, now fix it"
| +-- Edit("src/api/payment/auth.ts", ...)
| +-- Bash("npm test -- auth")
|
DONE (2 turns of retrieval + 1 turn of action)
Comparison
| Dimension | Traditional RAG | Claude Code |
|---|---|---|
| Index | Offline vectorization | No index, real-time search |
| Retrieval strategy | Fixed (Top-K) | Dynamic (model decides) |
| Retrieval rounds | Single | Multi-round loop |
| Retrieval granularity | Fixed chunks | Filename → content → line-level |
| Context protection | None | Sub-agent isolation |
| Result trimming | Truncation | Multi-layer budget controls |
| Caching | Vector cache | Prompt Cache partitioning + Read dedup |
Claude Code's RAG is fundamentally not "retrieval-augmented generation" but "generation-driven retrieval": the model first understands the need, then decides what to search, evaluates whether the results are sufficient, and searches again if not. It trades a bit of latency for precision and flexibility far beyond traditional RAG.