深入拆解Claude Code的记忆管理机制

用过Claude Code的开发者可能都有这样的体验:即便在一次超长对话中修改了几十个文件,它似乎始终"记得"之前做过什么。更神奇的是,你在上一次对话里告诉它"我喜欢用bun而不是npm",下次它就自动遵守了。

这背后是一套精密的记忆管理系统。今天,我们把Claude Code的记忆机制彻底拆解开来。

全局架构:三层记忆体系

Claude Code 的记忆管理可以类比人类的记忆系统:

+----------------------------------------------------------+
|          Persistent Memory (Long-term Memory)            |
|  memdir: MEMORY.md index + topic files                   |
|  Persisted across sessions                               |
+----------------------------------------------------------+
|          Session Memory (Working Memory)                 |
|  Background sub-agent maintains Markdown summary         |
|  Valid within a single session                           |
+----------------------------------------------------------+
|          Context Window (Short-term Memory)              |
|  Raw messages + tool call results in current conversation|
|  Immediately available                                   |
+----------------------------------------------------------+

接下来逐层拆解。


第一层:上下文窗口管理——三级压缩策略

1.1 何时触发压缩?

自动压缩的触发逻辑,用伪代码表达如下:

BUFFER_TOKENS = 13_000  # 预留缓冲区

def get_auto_compact_threshold(model):
    """计算自动压缩阈值 ≈ 有效窗口的 ~93%"""
    effective_window = get_context_window(model) - MAX_OUTPUT_TOKENS
    return effective_window - BUFFER_TOKENS

def should_auto_compact(messages, model):
    token_count = estimate_token_count(messages)
    threshold = get_auto_compact_threshold(model)
    return token_count >= threshold

以 200K 上下文窗口为例,大约在 93% 使用率时自动启动压缩。

此外还有一个熔断机制——连续失败 3 次后停止重试,防止"死循环式"地浪费 API 调用:

MAX_FAILURES = 3

def auto_compact_if_needed(messages, tracking):
    # 熔断:连续失败太多次,直接放弃
    if tracking.consecutive_failures >= MAX_FAILURES:
        return NOT_COMPACTED

    if not should_auto_compact(messages):
        return NOT_COMPACTED

    # ... 执行压缩

1.2 三级压缩的执行顺序

当压缩被触发时,系统按优先级尝试三种策略:

  +--------------------------+
  |  Context reaching ~93%   |
  +------------+-------------+
               |
               v
+---------------------------------+
|  Level 1: SM Compact            |
|  Replace old msgs with          |  <-- Zero API calls
|  pre-built summary              |
+------+----------------+---------+
  OK   |                | Fail / N/A
       v                v
   return   +---------------------------------+
            |  Level 2: Full Compact          |
            |  Call API to generate           |  <-- 1 API call
            |  structured summary             |
            +------+----------------+---------+
              OK   |                | prompt-too-long
                   v                v
               return   +---------------------------------+
                        |  Level 3: PTL Retry             |
                        |  Truncate oldest messages,      |  <-- Max 3 retries
                        |  retry compaction               |
                        +---------------------------------+

整体流程的伪代码:

def auto_compact_if_needed(messages, context):
    # === 优先尝试 Session Memory 压缩(零成本)===
    result = try_session_memory_compaction(messages)
    if result:
        return result  # 成功!没有调用任何 API

    # === 回退到传统 API 压缩 ===
    result = full_compact_conversation(messages, context)
    return result

1.3 Level 1: Session Memory Compact——零成本压缩

这是最巧妙的压缩策略。当 Session Memory 文件可用时,直接用它替换旧的对话历史,不需要任何额外的 API 调用

保留消息的策略由三个参数控制:

SM_COMPACT_CONFIG = {
    "min_tokens": 10_000,           # 至少保留 10K tokens 的原始消息
    "min_text_messages": 5,          # 至少保留 5 条含文本的消息
    "max_tokens": 40_000,           # 最多保留 40K tokens(硬上限)
}

压缩前后的对比如下图所示:

Before (context window almost full):
+-----------------------------------------------------+
| msg1 | msg2 | ... | msg50 | msg51 | ... | msg80     |
|<-- covered by SM summary -->|<-- recent messages -->|
+-----------------------------------------------------+

                       | SM Compact
                       v

After (lots of free space):
+-----------------------------------------------------+
| [boundary] | SM summary  | msg51 | ... | msg80      |
|            | (~12K toks) |<-- kept raw messages --> |
|            |             |   (10K ~ 40K tokens)     |
| <----- used -----------> |<---- free space -------->|
+-----------------------------------------------------+

决定保留哪些消息的核心算法:

def calculate_messages_to_keep(messages, last_summarized_idx):
    """从最后一条已摘要的消息开始,向前扩展直到满足条件"""
    config = SM_COMPACT_CONFIG
    start = last_summarized_idx + 1
    total_tokens = 0
    text_msg_count = 0

    # 先统计 start 之后已有多少
    for msg in messages[start:]:
        total_tokens += estimate_tokens(msg)
        if has_text_content(msg):
            text_msg_count += 1

    # 如果已经达到硬上限或同时满足两个下限,直接返回
    if total_tokens >= config["max_tokens"]:
        return start
    if total_tokens >= config["min_tokens"] and \
       text_msg_count >= config["min_text_messages"]:
        return start

    # 否则向前扩展,拉入更多消息
    for i in range(start - 1, -1, -1):
        total_tokens += estimate_tokens(messages[i])
        if has_text_content(messages[i]):
            text_msg_count += 1
        start = i

        if total_tokens >= config["max_tokens"]:
            break
        if total_tokens >= config["min_tokens"] and \
           text_msg_count >= config["min_text_messages"]:
            break

    # 最后:保护 tool_use/tool_result 配对不被拆散
    return adjust_for_tool_pairs(messages, start)

这里有一个精妙的细节——保护 tool_use/tool_result 配对不被拆散

def adjust_for_tool_pairs(messages, start_index):
    """确保保留的消息里每个 tool_result 都有对应的 tool_use"""
    # 收集保留范围内所有的 tool_result ID
    needed_tool_use_ids = set()
    for msg in messages[start_index:]:
        for block in msg.content:
            if block.type == "tool_result":
                needed_tool_use_ids.add(block.tool_use_id)

    # 向前搜索,把对应的 tool_use 消息也拉进保留范围
    for i in range(start_index - 1, -1, -1):
        if has_matching_tool_use(messages[i], needed_tool_use_ids):
            start_index = i  # 扩展保留范围

    return start_index

为什么要这样做?因为 Claude API 要求每个 tool_result 必须有对应的 tool_use,否则会报错。如果压缩时恰好在一对 tool 调用中间"切了一刀",就会出现"孤儿 tool_result"导致后续 API 调用失败。

1.4 Level 2: Full Compact——传统 API 压缩

当 Session Memory 不可用时,调用 Claude API 对整段对话生成摘要。

压缩 prompt 要求生成包含 9 个章节的结构化摘要:

+------- Full Compact Summary Structure -------+
|                                              |
|  1. Primary Request and Intent               |
|  2. Key Technical Concepts                   |
|  3. Files and Code Sections                  |
|  4. Errors and Fixes                         |
|  5. Problem Solving                          |
|  6. All User Messages  <-- verbatim!         |
|  7. Pending Tasks                            |
|  8. Current Work                             |
|  9. Optional Next Step                       |
|                                              |
+----------------------------------------------+
        ^                          ^
        |                          |
  <analysis> scratchpad     #6 preserves user's
  think-then-summarize      exact words to prevent
  stripped before storing   intent drift

三个值得注意的设计:

设计 1:<analysis> 草稿区 —— 让模型先在 <analysis> 标签内自由分析,再在 <summary> 标签内输出最终摘要。草稿区最终会被程序剥离,不进入上下文,但它显著提高了摘要质量。

设计 2:禁止使用工具 —— 压缩 Agent 被严格限制只能输出文本。因为它只有 1 个 turn 的预算,如果尝试调用工具,这个 turn 就浪费了。

设计 3:Prompt Cache 共享 —— 压缩 Agent 通过 "Fork" 机制启动,与主对话共享相同的 system prompt、工具列表和消息前缀,从而复用 prompt cache,大幅降低成本。

1.5 Level 3: PTL 重试——最后的安全网

如果压缩请求本身也超出了 token 限制(prompt-too-long),会触发截断重试:

MAX_PTL_RETRIES = 3

def compact_with_retry(messages, summary_request):
    for attempt in range(MAX_PTL_RETRIES):
        response = call_api(messages + [summary_request])

        if not is_prompt_too_long(response):
            return response  # 成功

        # 按 API 轮次分组,从最旧的开始丢弃
        groups = group_by_api_round(messages)

        if token_gap is not None:
            # 精确模式:根据超出量计算要丢弃多少组
            drop_count = calculate_exact_drop(groups, token_gap)
        else:
            # 回退模式:丢弃最旧的 20%
            drop_count = max(1, len(groups) * 0.2)

        messages = flatten(groups[drop_count:])

    raise Error("对话太长,无法压缩")

1.6 压缩后的状态恢复

压缩不只是删消息。旧消息被丢弃后,很多隐含的上下文也丢了。系统需要恢复关键状态:

+--- 6 Types of State to Restore After Compaction ----+
|                                                     |
|  1. Recent files (max 5, budget 50K tokens)         |
|     Avoid re-reading same files after compact       |
|                                                     |
|  2. Current Plan file                               |
|     Keep the plan if one is in progress             |
|                                                     |
|  3. Invoked Skills content (max 5K tokens each)     |
|     Skill instructions must survive compact         |
|                                                     |
|  4. Async Agent status                              |
|     Is a background agent done? Where's the result? |
|                                                     |
|  5. Deferred tool definitions                       |
|     Re-register tools discovered before compact     |
|                                                     |
|  6. SessionStart hooks (CLAUDE.md etc.)             |
|     Project config must be re-injected into context |
|                                                     |
+-----------------------------------------------------+

同时有一个巧妙的去重逻辑——如果文件内容已经在保留的消息中可见,就跳过恢复,避免浪费 token:

def restore_files_after_compact(read_file_cache, kept_messages):
    # 收集保留消息中已有的 Read 结果路径
    already_visible = collect_read_paths_from(kept_messages)

    recent_files = sorted(read_file_cache, key=lambda f: f.timestamp, reverse=True)

    budget_used = 0
    for file in recent_files[:5]:
        if file.path in already_visible:
            continue  # 跳过!消息里已经有了
        if budget_used + file.tokens > 50_000:
            break
        restore_as_attachment(file)
        budget_used += file.tokens

第二层:Session Memory——后台持续摘要引擎

2.1 核心机制

Session Memory 是 Claude Code 最创新的设计之一。它在后台运行一个独立的子 Agent,持续将对话内容提炼为结构化的 Markdown 笔记文件。

+-- Main Conversation Loop ----------------------------------+
|                                                            |
|  User query -> Model response -> Tool call -> ... loop     |
|                     |                                      |
|           post_sampling_hook fires                         |
|           after each response                              |
|                     |                                      |
|                     v                                      |
|  +--------------------------------------------+            |
|  | Should Session Memory update?              |            |
|  | (check token threshold & tool call count)  |            |
|  +--------+---------------------------+-------+            |
|      Yes  |                           | No                 |
|           v                           +--- skip            |
|  +------------------------+                                |
|  | Fork sub-agent         | <-- separate context,          |
|  | only allowed to Edit   |     non-blocking               |
|  | the summary file       |                                |
|  +-----------+------------+                                |
|              v                                             |
|  Update session_memory.md                                  |
|                                                            |
+------------------------------------------------------------+

2.2 触发条件:双阈值门控

不是每轮对话都会更新摘要,需要满足精心设计的条件:

def should_extract_memory(messages):
    current_tokens = estimate_total_tokens(messages)

    # 门槛 0:对话刚开始,内容太少,不提取
    if not initialized:
        if current_tokens < INIT_THRESHOLD:
            return False
        initialized = True

    # 门槛 1(硬性):自上次提取以来,token 增长是否足够?
    token_threshold_met = (
        current_tokens - last_extraction_tokens >= MIN_TOKENS_BETWEEN_UPDATE
    )

    # 门槛 2(软性):自上次提取以来,工具调用次数是否足够?
    tool_call_threshold_met = (
        count_tool_calls_since(last_extraction_msg) >= TOOL_CALLS_BETWEEN_UPDATES
    )

    # 门槛 3(机会性):最后一轮是否没有工具调用?(自然停顿点)
    at_natural_break = not has_tool_calls_in_last_turn(messages)

    # 触发条件:token 阈值是硬性要求,另外两个满足其一即可
    return token_threshold_met and (tool_call_threshold_met or at_natural_break)

设计哲学:Token 阈值是硬性要求(防止过于频繁提取浪费成本),工具调用次数和"自然停顿"是软性触发(在模型暂停思考的间隙提取,不打断工作流)。

2.3 摘要文件的结构

Session Memory 维护一个固定模板的 Markdown 文件,包含 10 个章节:

+--------- session_memory.md ----------+
|                                      |
|  # Session Title      (5-10 words)   |
|  # Current State      (important!)   |
|  # Task Specification                |
|  # Files and Functions               |
|  # Workflow                          |
|  # Errors & Corrections              |
|  # Codebase and System Docs          |
|  # Learnings                         |
|  # Key Results                       |
|  # Worklog                           |
|                                      |
|  Per-section limit : ~2000 tokens    |
|  Total file limit  : ~12000 tokens   |
+--------------------------------------+

当文件超出预算时,更新 prompt 会要求子 Agent 主动精简:

def build_update_prompt(current_notes, notes_path):
    prompt = render_template(TEMPLATE, current_notes, notes_path)

    # 分析每个章节的大小
    section_sizes = analyze_sections(current_notes)
    total = estimate_tokens(current_notes)

    # 超出总预算 → 强制精简
    if total > 12_000:
        prompt += f"""
        CRITICAL: 文件当前约 {total} tokens,超出上限 12000。
        你必须大幅精简,优先保留 Current State 和 Errors。"""

    # 单章节超限 → 提醒精简该章节
    oversized = [s for s in section_sizes if s.tokens > 2000]
    if oversized:
        prompt += f"\n以下章节需要精简: {oversized}"

    return prompt

2.4 更新方式:Fork Agent 与权限隔离

Session Memory 的更新通过一个受限的 Fork Agent 执行。关键设计——只允许编辑摘要文件,不允许任何其他操作

def create_memory_file_permission(memory_path):
    """创建权限函数:只允许 Edit 工具操作指定的摘要文件"""
    def can_use_tool(tool, input):
        if tool.name == "FileEdit" and input["file_path"] == memory_path:
            return ALLOW
        return DENY  # 其他任何操作都拒绝

    return can_use_tool

# 启动 Fork Agent
run_forked_agent(
    prompt       = build_update_prompt(current_notes, path),
    can_use_tool = create_memory_file_permission(path),  # 严格权限
    query_source = "session_memory",
    # Fork Agent 共享主对话的 prompt cache,成本更低
)

2.5 用户可定制

Session Memory 支持自定义模板和 prompt,用户可以放置在:

~/.claude/session-memory/config/
├── template.md    ← 自定义摘要模板(替换默认的 10 个章节)
└── prompt.md      ← 自定义更新指令(支持 {{currentNotes}} 等变量)

第三层:持久化记忆(memdir)——跨会话的长期记忆

3.1 架构设计

持久化记忆是一个基于文件系统的知识库:

~/.claude/projects/<项目>/memory/
│
├── MEMORY.md                ← 索引文件(始终加载到 system prompt)
│   内容示例:
│   - [User Preferences](user_preferences.md) — Prefers bun, uses Go
│   - [Project Setup](project_setup.md) — K8s deploy, GH Actions CI
│   - [API Gotchas](api_gotchas.md) — Auth token must be refreshed
│
├── user_preferences.md      ← 用户偏好
├── project_setup.md         ← 项目配置
├── feedback_testing.md      ← 反馈记录
├── api_gotchas.md           ← 参考信息
│
└── team/                    ← 团队共享记忆(可选)
    ├── MEMORY.md
    └── coding_standards.md

3.2 四种记忆类型

系统定义了严格的分类法,每种类型有明确的边界:

+-----------+----------------------------+-------------------------------+
|   Type    |       Description          |         Example               |
+-----------+----------------------------+-------------------------------+
|  user     | Identity, role, prefs      | "Backend dev, prefers Go"     |
+-----------+----------------------------+-------------------------------+
| feedback  | Corrections & feedback     | "Use const, not var"          |
+-----------+----------------------------+-------------------------------+
| project   | Non-code project knowledge | "Deploy: k8s, CI: GH Actions" |
+-----------+----------------------------+-------------------------------+
| reference | Tool / API usage notes     | "Internal API auth method"    |
+-----------+----------------------------+-------------------------------+

     X  NOT saved: info derivable from code (architecture, git history)

每个记忆文件使用 YAML frontmatter 标注元数据:

---
name: user_preferences
description: User's coding style preferences
type: feedback
---
Use bun instead of npm for all package management.
Prefer functional components over class components.

3.3 索引文件的限制

MEMORY.md 始终被加载到 system prompt,因此有严格限制,防止它无限膨胀:

MAX_LINES = 200       # 最多 200 行
MAX_BYTES = 25_000    # 最多 25KB

def truncate_memory_index(raw_content):
    lines = raw_content.strip().split("\n")

    # 先按行数截断
    if len(lines) > MAX_LINES:
        lines = lines[:MAX_LINES]
        truncated_by_lines = True

    # 再按字节数截断(防止单行特别长的情况)
    content = "\n".join(lines)
    if len(content) > MAX_BYTES:
        cut_at = content.rfind("\n", 0, MAX_BYTES)  # 在行边界截断
        content = content[:cut_at]
        truncated_by_bytes = True

    if truncated_by_lines or truncated_by_bytes:
        content += "\n\n> WARNING: MEMORY.md 过大,仅加载了部分内容。"
        content += "\n> 请保持索引条目在一行 ~200 字符以内,详情写入主题文件。"

    return content

3.4 智能记忆检索

并非每次对话都加载所有记忆文件。系统使用一个轻量级的 Sonnet 模型做实时检索,按需加载最相关的记忆:

User sends: "Help me fix auth logic in the payment API"
    |
    v
+----------------------------------------------+
| Scan memory/ dir for all .md files           |
| Read each file's frontmatter                 |
| (name + description + type)                  |
| Exclude files already shown this turn        |
+----------------------+-----------------------+
                       |
                       v
+----------------------------------------------+
| Call Sonnet (lightweight, fast)              |
|                                              |
| System: "Select up to 5 memories that        |
|  will be useful for the current query"       |
|                                              |
| User: "Query: fix payment API auth logic     |
|  Available memories:                         |
|  - api_gotchas.md: API auth caveats    [HIT] |
|  - user_prefs.md: coding preferences   [HIT] |
|  - old_deploy.md: old deploy notes    [SKIP] |
|  Recent tools: FileEdit, Bash"               |
|                                              |
| -> Returns: ["api_gotchas.md",               |
|              "user_prefs.md"]                |
+----------------------+-----------------------+
                       |
                       v
         Load selected memory files into context

这里有两个巧妙的过滤逻辑:

  • 去重:如果某个记忆文件在之前的轮次中已经展示过,就不再推荐,把名额留给新的候选
  • 工具感知:如果用户最近在用某个工具(如 mcp__X__spawn),就不推荐该工具的使用文档(因为对话里已经有了使用示例),但仍然推荐包含 warnings/gotchas 的记忆

3.5 后台记忆提取

除了用户主动说"记住这个",Claude Code 还会后台自动提取值得记住的内容。提取 Agent 有严格的约束:

+--- Memory Extraction Agent Rules --------+
|                                          |
|  ALLOWED:                                |
|  +- Read / Grep / Glob memory dir files  |
|  +- Read-only Bash (ls/find/cat/stat)    |
|  +- Edit / Write files inside memory dir |
|                                          |
|  DENIED:                                 |
|  +- X  Read or modify source code files  |
|  +- X  Write-capable Bash commands       |
|  +- X  MCP, Agent, or other tools        |
|  +- X  Bash rm (no deleting files)       |
|                                          |
|  EFFICIENCY (limited turn budget):       |
|  +- Turn 1: all Read calls in parallel   |
|  +- Turn 2: all Write/Edit in parallel   |
|                                          |
|  CONTENT CONSTRAINT:                     |
|  +- Extract from conversation only,      |
|     never investigate source code        |
|                                          |
+------------------------------------------+

三层记忆如何协作?

用一个完整的时间线来看三层记忆如何配合:

Timeline ─────────────────────────────────────────────────────►

▎ SESSION START
▎ ├─ Load MEMORY.md index into system prompt ··········  Persistent Memory
▎ ├─ Sonnet retrieves relevant memory files ···········  Persistent Memory
▎
▎ CONVERSATION IN PROGRESS...
▎ ├─ Each model response -> post_sampling_hook ········  Session Memory
▎ │                                                     (background update)
▎ │
▎ │  [Conversation grows ~5K tokens]
▎ │  └─ Session Memory updates session_memory.md
▎ │
▎ │  [Conversation continues...]
▎ │  └─ Memory Extraction Agent checks for ············  Persistent Memory
▎ │     save-worthy content (preferences, feedback)      (background write)
▎ │
▎ │  [Context approaching 93%]
▎ │  ├─ Try SM Compact ································  Context Management
▎ │  │   ├─ Success -> keep recent 10K-40K tokens
▎ │  │   └─ Fail -> Full Compact -> PTL Retry
▎ │  └─ Restore files, Plan, Skills, Agent state
▎ │
▎ │  [Conversation continues... (loop)]
▎
▎ SESSION END
▎ ├─ session_memory.md preserved (resumable) ··········  Session Memory
▎ └─ memory/*.md persisted permanently ················  Persistent Memory
▎
▎ NEXT SESSION START
▎ ├─ Reload MEMORY.md
▎ ├─ Retrieve relevant memory files
▎ └─ If resuming -> read session_memory.md as context

工程细节亮点

Prompt Cache 保护

压缩是 prompt cache 的天敌——消息全变了,cache 就失效了。系统的应对策略:

Before compaction:
  Main prompt cache: [system_prompt | tools | msg1...msgN]
                     <---- cache hit ---->

After compaction:
  New messages: [system_prompt | tools | summary | recent_msgs]
                <-- still hit! -->
                (system + tools unchanged)

  + Notify cache monitor: "compaction just happened,
    cache hit-rate drop is expected, suppress alerts"

Fork Agent 的设计也是为了最大化 cache 复用——压缩 Agent 发送的请求前缀与主对话完全相同(system prompt + tools + 消息前缀),从而命中同一份 cache。

图片和媒体的处理

压缩前会剥离图片和文档附件(它们对生成文字摘要无用,但占大量 token):

def strip_images(messages):
    """图片和文档 → 文字占位符"""
    for msg in messages:
        for block in msg.content:
            if block.type == "image":
                replace_with("[image]")
            elif block.type == "document":
                replace_with("[document]")
    return messages

会话元数据的持久化

一个容易忽略的细节:压缩后需要重新追加会话元数据(自定义标题、标签)。因为 --resume 功能通过读取日志文件的尾部 16KB 来获取会话信息,如果压缩后产生大量新消息,元数据会被挤出这个窗口,导致显示自动生成的标题而非用户设置的名称。


总结

Claude Code 的记忆管理系统是一个精心设计的三层架构:

层级 机制 生命周期 成本
上下文窗口 三级压缩(SM → Full → PTL) 单次对话 SM: 零, Full: 1 次 API
Session Memory 后台 Fork Agent 持续摘要 单次会话 每 ~5K tokens 一次轻量 API
持久化记忆 文件系统 + Sonnet 智能检索 永久 检索: 1 次 Sonnet 调用

最核心的设计哲学:不是追求更大的上下文窗口,而是在有限窗口内做到"记得住、忘得对、找得到"。

  • 记得住:Session Memory 持续提炼,不遗漏关键信息
  • 忘得对:三级压缩保留最重要的内容,丢弃冗余的工具调用细节
  • 找得到:memdir 按类型组织 + Sonnet 智能检索,精准召回

这套系统让 Claude Code 在 200K 的上下文窗口内,实现了远超 200K 的"等效记忆容量"。


本文所有代码块均为基于实际行为和公开信息推断的 Python 伪代码或示意图,用于帮助理解 Claude Code 的内部机制。