拆解 Claude Code 的 RAG 机制

发表于： 2026-04-02 分类于： NLP 阅读：≈ 6分钟浏览：评论：

Claude Code 没有向量数据库，没有 embedding 索引，但它能在百万行代码库里精准定位你需要的文件。这背后是一套完全不同于传统 RAG 的检索架构。

这不是你理解的 RAG

用过 RAG 的人对这套流程应该很熟：离线建索引，用户提问，向量检索 Top-K，拼入 prompt，生成回答。一条直线，跑一次就完事。

Claude Code 完全不是这么干的。它没有离线索引，检索过程由模型自己驱动。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
Traditional RAG:
+--------+     +-----------+     +----------+     +--------+
| Query  | --> | Vector DB | --> | Top-K    | --> | LLM    |
|        |     | (offline  |     | chunks   |     | answer |
|        |     |  indexed) |     | injected |     |        |
+--------+     +-----------+     +----------+     +--------+

      One-shot: retrieve once, generate once.


Claude Code (Agentic RAG):
+--------+     +------------------+     +--------+
| Query  | --> | LLM decides what | --> | Tool   |
|        |     | to search for    |     | result |
+--------+     +-------+----------+     +---+----+
                       ^                    |
                       |   not enough?      |
                       +--------------------+
                       loop until satisfied

      Multi-hop: model drives retrieval in a loop.

传统 RAG 的检索策略是固定的，写死在代码里。Claude Code 的检索策略是动态的：模型根据当前上下文自行决定搜什么、搜几次、用哪个工具。

四层检索架构

Claude Code 的上下文不是一次性拼好的，而是分四层递进注入：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
+===============================================================+
|                     CONTEXT WINDOW                            |
+===============================================================+
|                                                               |
|  Layer 0: STATIC CONTEXT (loaded once at session start)       |
|  +---------------------------------------------------------+  |
|  | System Prompt | CLAUDE.md | Git Status | Memory Index   |  |
|  +---------------------------------------------------------+  |
|                                                               |
|  Layer 1: SMART PRE-INJECTION (before model sees the query)   |
|  +---------------------------------------------------------+  |
|  | Sonnet Memory Recall | @file mentions | Skill Discovery |  |
|  +---------------------------------------------------------+  |
|                                                               |
|  Layer 2: MODEL-DRIVEN RETRIEVAL (tool use loop)              |
|  +---------------------------------------------------------+  |
|  | Glob -> Grep -> Read -> ... (model decides)             |  |
|  +---------------------------------------------------------+  |
|                                                               |
|  Layer 3: DELEGATED RETRIEVAL (sub-agents)                    |
|  +---------------------------------------------------------+  |
|  | Explore Agent | Fork Agent (parallel research)          |  |
|  +---------------------------------------------------------+  |
|                                                               |
+===============================================================+

Layer 0：静态上下文

每次会话开始时，系统自动加载一批常驻上下文：System Prompt（工具使用指南、行为规范、环境信息），CLAUDE.md（项目级配置），Git Status（当前分支、最近 5 条 commit），以及 MEMORY.md 索引文件。

这里有个关键设计：System Prompt 被分为静态区和动态区，中间用一个边界标记隔开。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
system_prompt = [
    # --- Static (cross-session cacheable) ---
    intro_section,          # identity & rules
    system_section,         # tool instructions
    doing_tasks_section,    # task guidelines
    actions_section,        # safety rules
    using_tools_section,    # tool selection guide
    tone_and_style,         # output style
    output_efficiency,      # brevity rules

    "__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__",  # <-- cache boundary

    # --- Dynamic (changes per session) ---
    session_guidance,       # session-specific guidance
    memory_prompt,          # persistent memory
    env_info,               # environment info
    language,               # language preference
    mcp_instructions,       # MCP server instructions
]

边界标记之前的内容可以跨会话共享 Prompt Cache，所有用户共用同一份缓存。之后的内容每个会话不同。静态部分几乎零成本。

Layer 1：智能预注入

用户发送消息后、模型开始推理前，系统并行执行一批预检索操作：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
User sends message: "Fix the auth bug in payment API"
                |
                v  (parallel, before model sees anything)
    +-----------+-----------+-----------+
    |           |           |           |
    v           v           v           v
 @mention   Memory      Skill       Agent
 file scan  Prefetch    Discovery   Listing
    |           |           |           |
    +-----+-----+-----+-----+-----+----+
          |                        |
          v                        v
   [Attachment Messages injected into conversation]

其中最精妙的是 Memory Prefetch。它用 Sonnet 模型做一次轻量级 sideQuery，从持久化记忆中挑选最多 5 个相关文件：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def prefetch_relevant_memories(user_query, messages):
    already_surfaced = collect_surfaced_memories(messages)
    recent_tools = collect_recent_successful_tools(messages)

    selected = sonnet_side_query(
        system="Select up to 5 memories useful for the query",
        user=f"Query: {user_query}\nAvailable: {memory_manifest}",
        recent_tools=recent_tools,
    )
    return load_selected_memories(selected)

sideQuery 是 Claude Code 的一个关键模式：主循环之外，用一个较小的模型做辅助决策，结果作为 attachment 注入到下一轮对话中。成本极低（Sonnet 比 Opus 便宜很多），但上下文相关性大幅提升。

Layer 2：模型驱动的检索

这是 Agentic RAG 的核心。模型拥有三个搜索工具，自主决定调用顺序和次数：

工具	能力	典型用法
Glob	按文件名模式匹配	`src/*/.ts`，找文件结构
Grep	按内容正则搜索	`function handleAuth`，找实现
Read	读取文件内容	精确读取目标文件

模型通常按 Glob → Grep → Read 的顺序漏斗式缩小范围：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Model thinks: "I need to find the auth handler"
      |
      v
Glob("src/**/*auth*.ts")        --> 8 files found
      |
      v
Grep("handleAuth", path="src/") --> 3 files match
      |
      v
Read("src/api/auth/handler.ts") --> full content
      |
      v
(enough context, start working)

关键在于这不是固定流程。模型可能直接 Read（用户给了文件路径），可能 Grep 多次（第一次没找到，换个关键词），也可能同时发起多个并行搜索。这种灵活性是传统 RAG 做不到的。

Layer 3：子 Agent 委托检索

当搜索任务较重时（比如"帮我理解这个项目的认证架构"），模型可以启动一个 Explore Agent 来代劳：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Main Agent context:                Explore Agent context:
+---------------------------+      +---------------------------+
| User query                |      | Search directive          |
| ... (valuable context)    |      | Glob, Grep, Read results  |
|                           |      | ... (raw search output)   |
| [Agent tool call]  -------|----> | ... (lots of content)     |
|                           |      +---------------------------+
| [Agent result: summary] <-|----  | Final summary (concise)   |
|                           |      +---------------------------+
| (context stays clean!)    |
+---------------------------+

Explore Agent 的设计很克制：只读（不能创建、修改、删除任何文件），用 Haiku（最快的模型）运行，所有原始搜索结果留在子 Agent 的上下文里，只返回精炼的摘要，甚至不加载 CLAUDE.md（主 Agent 已经有了）。

这解决了 Agentic RAG 的一个核心问题：搜索过程本身会消耗大量上下文。如果所有 Glob/Grep/Read 结果都留在主上下文里，几轮搜索下来上下文就满了。子 Agent 充当了一个上下文防火墙。

工程细节亮点

搜索结果的 Token 预算控制

每个搜索工具都有结果裁剪机制，防止单次搜索淹没上下文：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
+-- Search Result Budget Controls --+
|                                   |
|  Grep:                            |
|    Default head_limit = 250 lines |
|    Max result size = 20KB chars   |
|    Max line width = 500 chars     |
|    (base64/minified auto-trimmed) |
|                                   |
|  Glob:                            |
|    Max 100 files returned         |
|    Sorted by mtime (newest first) |
|    Paths relativized to save toks |
|                                   |
|  Read:                            |
|    Max 25,000 tokens per read     |
|    Max 256KB file size            |
|    Default 2000 lines             |
|    Supports offset + limit paging |
|                                   |
+-----------------------------------+

Grep 的 head_limit=250 是一个很精妙的默认值：足够大，能覆盖绝大多数探索性搜索；足够小，不会一次灌入 6000+ token 的搜索结果。如果模型知道需要更多，可以传 head_limit=0（无限制）或用 offset 翻页。

Read 工具的去重优化

同一个文件被读两次很常见，先搜到，后编辑前再确认。Claude Code 会自动检测：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def read_file(path, offset, limit):
    existing = read_file_state.get(path)

    if existing and existing.offset == offset and existing.limit == limit:
        mtime = get_file_mtime(path)
        if mtime == existing.timestamp:
            return "File unchanged since last read."
            # saves ~25K tokens per hit

    content = read_file_content(path, offset, limit)
    read_file_state.set(path, content, mtime, offset, limit)
    return content

大约 18% 的 Read 调用命中去重，每次节省一整个文件的 token 开销。

路径相对化

一个不起眼但无处不在的优化：所有搜索结果中的绝对路径都被转换为相对路径。

1
2
3
4
5
6
7
# Before (absolute paths waste tokens):
/Users/john/projects/my-app/src/components/auth/LoginForm.tsx
/Users/john/projects/my-app/src/components/auth/AuthProvider.tsx

# After (relative to cwd):
src/components/auth/LoginForm.tsx
src/components/auth/AuthProvider.tsx

看起来微不足道，但在一个包含几十次搜索的会话中，这能省下几千个 token。

System Prompt 的缓存分区

前面提到的 __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ 配合 Anthropic API 的 scope: ‘global’ 缓存策略，让静态部分在所有用户之间共享缓存：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
API Request:
+--------------------------------------------------+
| system[0]: intro + tools + rules                 |  scope: 'global'
| system[1]: actions + style                       |  (shared across ALL users)
|                                                  |
|  ---- DYNAMIC BOUNDARY ----                      |
|                                                  |
| system[2]: session guidance                      |  scope: 'session'
| system[3]: memory + env info                     |  (per-user)
+--------------------------------------------------+
| messages: [user, assistant, ...]                 |
+--------------------------------------------------+

边界之前的内容（工具描述、行为准则等）对所有用户完全相同，只需要 cache 一次。你的第一次请求可能就命中了别人已经缓存好的 system prompt 前缀。这是全局级别的去重。

端到端流程

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
SESSION START
|
+-- Load system prompt (static + dynamic)
+-- Load CLAUDE.md into user context
+-- Load MEMORY.md index
+-- Snapshot git status + recent commits
|
USER SENDS: "Fix the auth bug in payment API"
|
+-- [Parallel prefetch, before model runs]
|   +-- Sonnet memory recall -> api_gotchas.md, user_prefs.md
|   +-- @file scan -> (none mentioned)
|   +-- Skill discovery -> (no matching skills)
|
+-- Prefetch results injected as attachment messages
|
MODEL TURN 1:
|   Thinks: "I need to find auth-related files"
|   +-- Grep("auth.*bug|payment.*auth", type="ts")    --> 5 files
|   +-- Read("src/api/payment/auth.ts")                --> 800 lines
|   +-- Read("src/api/payment/__tests__/auth.test.ts") --> 400 lines
|
MODEL TURN 2:
|   Thinks: "Found the bug, now fix it"
|   +-- Edit("src/api/payment/auth.ts", ...)
|   +-- Bash("npm test -- auth")
|
DONE (2 turns of retrieval + 1 turn of action)

对比

维度	传统 RAG	Claude Code
索引	离线向量化	无索引，实时搜索
检索策略	固定（Top-K）	动态（模型决定）
检索轮次	单轮	多轮循环
检索粒度	固定 chunk	文件名 → 内容 → 行级
上下文保护	无	子 Agent 隔离
结果裁剪	截断	多层预算控制
缓存	向量缓存	Prompt Cache 分区 + Read 去重

Claude Code 的 RAG 本质上不是"检索增强生成"，而是"生成驱动检索"：模型先理解需求，再决定搜什么，搜完判断够不够，不够再搜。牺牲了一点延迟，换来了远超传统 RAG 的精准度和灵活性。