Dissecting Claude Code's RAG Mechanism

Posted on 2026-04-02 Edited on 2026-04-03 In NLP 评论: Views:

Claude Code has no vector database and no embedding index, yet it can pinpoint the exact file you need in a million-line codebase. Behind this is a retrieval architecture completely different from traditional RAG.

This Isn't the RAG You Know

If you've used RAG before, the pipeline should be familiar: build an offline index, user asks a question, vector-search for Top-K chunks, inject into prompt, generate an answer. A straight line, one pass, done.

Claude Code doesn't work like that at all. It has no offline index. The model itself drives the retrieval process.

Traditional RAG:
+--------+     +-----------+     +----------+     +--------+
| Query  | --> | Vector DB | --> | Top-K    | --> | LLM    |
|        |     | (offline  |     | chunks   |     | answer |
|        |     |  indexed) |     | injected |     |        |
+--------+     +-----------+     +----------+     +--------+

      One-shot: retrieve once, generate once.


Claude Code (Agentic RAG):
+--------+     +------------------+     +--------+
| Query  | --> | LLM decides what | --> | Tool   |
|        |     | to search for    |     | result |
+--------+     +-------+----------+     +---+----+
                       ^                    |
                       |   not enough?      |
                       +--------------------+
                       loop until satisfied

      Multi-hop: model drives retrieval in a loop.

Traditional RAG has a fixed retrieval strategy, hardcoded in the pipeline. Claude Code's retrieval strategy is dynamic: the model decides what to search for, how many times, and which tool to use, all based on the current context.

Four-Layer Retrieval Architecture

Claude Code's context isn't assembled all at once. It's injected progressively across four layers:

+===============================================================+
|                     CONTEXT WINDOW                            |
+===============================================================+
|                                                               |
|  Layer 0: STATIC CONTEXT (loaded once at session start)       |
|  +---------------------------------------------------------+  |
|  | System Prompt | CLAUDE.md | Git Status | Memory Index   |  |
|  +---------------------------------------------------------+  |
|                                                               |
|  Layer 1: SMART PRE-INJECTION (before model sees the query)   |
|  +---------------------------------------------------------+  |
|  | Sonnet Memory Recall | @file mentions | Skill Discovery |  |
|  +---------------------------------------------------------+  |
|                                                               |
|  Layer 2: MODEL-DRIVEN RETRIEVAL (tool use loop)              |
|  +---------------------------------------------------------+  |
|  | Glob -> Grep -> Read -> ... (model decides)             |  |
|  +---------------------------------------------------------+  |
|                                                               |
|  Layer 3: DELEGATED RETRIEVAL (sub-agents)                    |
|  +---------------------------------------------------------+  |
|  | Explore Agent | Fork Agent (parallel research)          |  |
|  +---------------------------------------------------------+  |
|                                                               |
+===============================================================+

Layer 0: Static Context

At the start of every session, the system automatically loads a set of resident context: the System Prompt (tool usage guide, behavioral rules, environment info), CLAUDE.md (project-level configuration), Git Status (current branch, last 5 commits), and the MEMORY.md index file.

There's a key design detail here: the System Prompt is split into a static region and a dynamic region, separated by a boundary marker.

system_prompt = [
    # --- Static (cross-session cacheable) ---
    intro_section,          # identity & rules
    system_section,         # tool instructions
    doing_tasks_section,    # task guidelines
    actions_section,        # safety rules
    using_tools_section,    # tool selection guide
    tone_and_style,         # output style
    output_efficiency,      # brevity rules

    "__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__",  # <-- cache boundary

    # --- Dynamic (changes per session) ---
    session_guidance,       # session-specific guidance
    memory_prompt,          # persistent memory
    env_info,               # environment info
    language,               # language preference
    mcp_instructions,       # MCP server instructions
]

Everything before the boundary marker can share Prompt Cache across sessions, with all users sharing the same cached copy. Everything after differs per session. The static portion costs essentially nothing.

Layer 1: Smart Pre-Injection

After the user sends a message but before the model starts reasoning, the system runs a batch of pre-retrieval operations in parallel:

User sends message: "Fix the auth bug in payment API"
                |
                v  (parallel, before model sees anything)
    +-----------+-----------+-----------+
    |           |           |           |
    v           v           v           v
 @mention   Memory      Skill       Agent
 file scan  Prefetch    Discovery   Listing
    |           |           |           |
    +-----+-----+-----+-----+-----+----+
          |                        |
          v                        v
   [Attachment Messages injected into conversation]

The most elegant piece is Memory Prefetch. It uses the Sonnet model to run a lightweight sideQuery, selecting up to 5 relevant files from persistent memory:

def prefetch_relevant_memories(user_query, messages):
    already_surfaced = collect_surfaced_memories(messages)
    recent_tools = collect_recent_successful_tools(messages)

    selected = sonnet_side_query(
        system="Select up to 5 memories useful for the query",
        user=f"Query: {user_query}\nAvailable: {memory_manifest}",
        recent_tools=recent_tools,
    )
    return load_selected_memories(selected)

The sideQuery is a key pattern in Claude Code: outside the main loop, a smaller model makes auxiliary decisions, and the results are injected as attachments into the next conversation turn. The cost is minimal (Sonnet is much cheaper than Opus), but context relevance improves dramatically.

Layer 2: Model-Driven Retrieval

This is the core of Agentic RAG. The model has three search tools and autonomously decides the calling order and frequency:

Tool	Capability	Typical Use
Glob	Match files by name pattern	`src/*/.ts`, find file structure
Grep	Search content by regex	`function handleAuth`, find implementation
Read	Read file content	Read target file precisely

The model typically narrows scope in a funnel pattern of Glob → Grep → Read:

Model thinks: "I need to find the auth handler"
      |
      v
Glob("src/**/*auth*.ts")        --> 8 files found
      |
      v
Grep("handleAuth", path="src/") --> 3 files match
      |
      v
Read("src/api/auth/handler.ts") --> full content
      |
      v
(enough context, start working)

The key point is that this isn't a fixed pipeline. The model might Read directly (if the user provided a file path), Grep multiple times (first attempt missed, try different keywords), or launch multiple parallel searches simultaneously. This flexibility is something traditional RAG simply can't match.

Layer 3: Sub-Agent Delegated Retrieval

When the search task is heavy (e.g., "help me understand this project's authentication architecture"), the model can spin up an Explore Agent to do the work:

Main Agent context:                Explore Agent context:
+---------------------------+      +---------------------------+
| User query                |      | Search directive          |
| ... (valuable context)    |      | Glob, Grep, Read results  |
|                           |      | ... (raw search output)   |
| [Agent tool call]  -------|----> | ... (lots of content)     |
|                           |      +---------------------------+
| [Agent result: summary] <-|----  | Final summary (concise)   |
|                           |      +---------------------------+
| (context stays clean!)    |
+---------------------------+

The Explore Agent's design is deliberately restrained: read-only (can't create, modify, or delete any files), runs on Haiku (the fastest model), keeps all raw search results in its own context, returns only a refined summary, and doesn't even load CLAUDE.md (the main Agent already has it).

This solves a core problem with Agentic RAG: the search process itself consumes massive context. If all Glob/Grep/Read results stayed in the main context, a few rounds of searching would fill it up. The sub-agent acts as a context firewall.

Engineering Highlights

Search Result Token Budget Controls

Every search tool has a result trimming mechanism to prevent a single search from flooding the context:

+-- Search Result Budget Controls --+
|                                   |
|  Grep:                            |
|    Default head_limit = 250 lines |
|    Max result size = 20KB chars   |
|    Max line width = 500 chars     |
|    (base64/minified auto-trimmed) |
|                                   |
|  Glob:                            |
|    Max 100 files returned         |
|    Sorted by mtime (newest first) |
|    Paths relativized to save toks |
|                                   |
|  Read:                            |
|    Max 25,000 tokens per read     |
|    Max 256KB file size            |
|    Default 2000 lines             |
|    Supports offset + limit paging |
|                                   |
+-----------------------------------+

Grep's head_limit=250 is a well-chosen default: large enough to cover the vast majority of exploratory searches, small enough to avoid dumping 6000+ tokens of search results in one shot. If the model knows it needs more, it can pass head_limit=0 (unlimited) or paginate with offset.

Read Tool Deduplication

Reading the same file twice is common — first discovered via search, then re-read before editing to confirm. Claude Code detects this automatically:

def read_file(path, offset, limit):
    existing = read_file_state.get(path)

    if existing and existing.offset == offset and existing.limit == limit:
        mtime = get_file_mtime(path)
        if mtime == existing.timestamp:
            return "File unchanged since last read."
            # saves ~25K tokens per hit

    content = read_file_content(path, offset, limit)
    read_file_state.set(path, content, mtime, offset, limit)
    return content

Roughly 18% of Read calls hit the dedup cache, saving an entire file's worth of tokens each time.

Path Relativization

An inconspicuous but ubiquitous optimization: all absolute paths in search results are converted to relative paths.

# Before (absolute paths waste tokens):
/Users/john/projects/my-app/src/components/auth/LoginForm.tsx
/Users/john/projects/my-app/src/components/auth/AuthProvider.tsx

# After (relative to cwd):
src/components/auth/LoginForm.tsx
src/components/auth/AuthProvider.tsx

Seems trivial, but across a session with dozens of searches, this saves thousands of tokens.

System Prompt Cache Partitioning

The __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ mentioned earlier works with Anthropic API's scope: 'global' caching strategy to share the static portion across all users:

API Request:
+--------------------------------------------------+
| system[0]: intro + tools + rules                 |  scope: 'global'
| system[1]: actions + style                       |  (shared across ALL users)
|                                                  |
|  ---- DYNAMIC BOUNDARY ----                      |
|                                                  |
| system[2]: session guidance                      |  scope: 'session'
| system[3]: memory + env info                     |  (per-user)
+--------------------------------------------------+
| messages: [user, assistant, ...]                 |
+--------------------------------------------------+

Everything before the boundary (tool descriptions, behavioral rules, etc.) is identical across all users and only needs to be cached once. Your very first request might hit a system prompt prefix that someone else already cached. This is deduplication at the global level.

End-to-End Flow

SESSION START
|
+-- Load system prompt (static + dynamic)
+-- Load CLAUDE.md into user context
+-- Load MEMORY.md index
+-- Snapshot git status + recent commits
|
USER SENDS: "Fix the auth bug in payment API"
|
+-- [Parallel prefetch, before model runs]
|   +-- Sonnet memory recall -> api_gotchas.md, user_prefs.md
|   +-- @file scan -> (none mentioned)
|   +-- Skill discovery -> (no matching skills)
|
+-- Prefetch results injected as attachment messages
|
MODEL TURN 1:
|   Thinks: "I need to find auth-related files"
|   +-- Grep("auth.*bug|payment.*auth", type="ts")    --> 5 files
|   +-- Read("src/api/payment/auth.ts")                --> 800 lines
|   +-- Read("src/api/payment/__tests__/auth.test.ts") --> 400 lines
|
MODEL TURN 2:
|   Thinks: "Found the bug, now fix it"
|   +-- Edit("src/api/payment/auth.ts", ...)
|   +-- Bash("npm test -- auth")
|
DONE (2 turns of retrieval + 1 turn of action)

Comparison

Dimension	Traditional RAG	Claude Code
Index	Offline vectorization	No index, real-time search
Retrieval strategy	Fixed (Top-K)	Dynamic (model decides)
Retrieval rounds	Single	Multi-round loop
Retrieval granularity	Fixed chunks	Filename → content → line-level
Context protection	None	Sub-agent isolation
Result trimming	Truncation	Multi-layer budget controls
Caching	Vector cache	Prompt Cache partitioning + Read dedup

Claude Code's RAG is fundamentally not "retrieval-augmented generation" but "generation-driven retrieval": the model first understands the need, then decides what to search, evaluates whether the results are sufficient, and searches again if not. It trades a bit of latency for precision and flexibility far beyond traditional RAG.