深入拆解Claude Code的记忆管理机制

发表于 2026-04-01 更新于 2026-04-02 分类于 NLP 评论：阅读次数：

用过Claude Code的开发者可能都有这样的体验：即便在一次超长对话中修改了几十个文件，它似乎始终"记得"之前做过什么。更神奇的是，你在上一次对话里告诉它"我喜欢用bun而不是npm"，下次它就自动遵守了。

这背后是一套精密的记忆管理系统。今天，我们把Claude Code的记忆机制彻底拆解开来。

全局架构：三层记忆体系

Claude Code 的记忆管理可以类比人类的记忆系统：

+----------------------------------------------------------+
|          Persistent Memory (Long-term Memory)            |
|  memdir: MEMORY.md index + topic files                   |
|  Persisted across sessions                               |
+----------------------------------------------------------+
|          Session Memory (Working Memory)                 |
|  Background sub-agent maintains Markdown summary         |
|  Valid within a single session                           |
+----------------------------------------------------------+
|          Context Window (Short-term Memory)              |
|  Raw messages + tool call results in current conversation|
|  Immediately available                                   |
+----------------------------------------------------------+

接下来逐层拆解。

第一层：上下文窗口管理——三级压缩策略

1.1 何时触发压缩？

自动压缩的触发逻辑，用伪代码表达如下：

BUFFER_TOKENS = 13_000  # 预留缓冲区

def get_auto_compact_threshold(model):
    """计算自动压缩阈值 ≈ 有效窗口的 ~93%"""
    effective_window = get_context_window(model) - MAX_OUTPUT_TOKENS
    return effective_window - BUFFER_TOKENS

def should_auto_compact(messages, model):
    token_count = estimate_token_count(messages)
    threshold = get_auto_compact_threshold(model)
    return token_count >= threshold

以 200K 上下文窗口为例，大约在 93% 使用率时自动启动压缩。

此外还有一个熔断机制——连续失败 3 次后停止重试，防止"死循环式"地浪费 API 调用：

MAX_FAILURES = 3

def auto_compact_if_needed(messages, tracking):
    # 熔断：连续失败太多次，直接放弃
    if tracking.consecutive_failures >= MAX_FAILURES:
        return NOT_COMPACTED

    if not should_auto_compact(messages):
        return NOT_COMPACTED

    # ... 执行压缩

1.2 三级压缩的执行顺序

当压缩被触发时，系统按优先级尝试三种策略：

  +--------------------------+
  |  Context reaching ~93%   |
  +------------+-------------+
               |
               v
+---------------------------------+
|  Level 1: SM Compact            |
|  Replace old msgs with          |  <-- Zero API calls
|  pre-built summary              |
+------+----------------+---------+
  OK   |                | Fail / N/A
       v                v
   return   +---------------------------------+
            |  Level 2: Full Compact          |
            |  Call API to generate           |  <-- 1 API call
            |  structured summary             |
            +------+----------------+---------+
              OK   |                | prompt-too-long
                   v                v
               return   +---------------------------------+
                        |  Level 3: PTL Retry             |
                        |  Truncate oldest messages,      |  <-- Max 3 retries
                        |  retry compaction               |
                        +---------------------------------+

整体流程的伪代码：

def auto_compact_if_needed(messages, context):
    # === 优先尝试 Session Memory 压缩（零成本）===
    result = try_session_memory_compaction(messages)
    if result:
        return result  # 成功！没有调用任何 API

    # === 回退到传统 API 压缩 ===
    result = full_compact_conversation(messages, context)
    return result

1.3 Level 1: Session Memory Compact——零成本压缩

这是最巧妙的压缩策略。当 Session Memory 文件可用时，直接用它替换旧的对话历史，不需要任何额外的 API 调用。

保留消息的策略由三个参数控制：

SM_COMPACT_CONFIG = {
    "min_tokens": 10_000,           # 至少保留 10K tokens 的原始消息
    "min_text_messages": 5,          # 至少保留 5 条含文本的消息
    "max_tokens": 40_000,           # 最多保留 40K tokens（硬上限）
}

压缩前后的对比如下图所示：

Before (context window almost full):
+-----------------------------------------------------+
| msg1 | msg2 | ... | msg50 | msg51 | ... | msg80     |
|<-- covered by SM summary -->|<-- recent messages -->|
+-----------------------------------------------------+

                       | SM Compact
                       v

After (lots of free space):
+-----------------------------------------------------+
| [boundary] | SM summary  | msg51 | ... | msg80      |
|            | (~12K toks) |<-- kept raw messages --> |
|            |             |   (10K ~ 40K tokens)     |
| <----- used -----------> |<---- free space -------->|
+-----------------------------------------------------+

决定保留哪些消息的核心算法：

def calculate_messages_to_keep(messages, last_summarized_idx):
    """从最后一条已摘要的消息开始，向前扩展直到满足条件"""
    config = SM_COMPACT_CONFIG
    start = last_summarized_idx + 1
    total_tokens = 0
    text_msg_count = 0

    # 先统计 start 之后已有多少
    for msg in messages[start:]:
        total_tokens += estimate_tokens(msg)
        if has_text_content(msg):
            text_msg_count += 1

    # 如果已经达到硬上限或同时满足两个下限，直接返回
    if total_tokens >= config["max_tokens"]:
        return start
    if total_tokens >= config["min_tokens"] and \
       text_msg_count >= config["min_text_messages"]:
        return start

    # 否则向前扩展，拉入更多消息
    for i in range(start - 1, -1, -1):
        total_tokens += estimate_tokens(messages[i])
        if has_text_content(messages[i]):
            text_msg_count += 1
        start = i

        if total_tokens >= config["max_tokens"]:
            break
        if total_tokens >= config["min_tokens"] and \
           text_msg_count >= config["min_text_messages"]:
            break

    # 最后：保护 tool_use/tool_result 配对不被拆散
    return adjust_for_tool_pairs(messages, start)

这里有一个精妙的细节——保护 tool_use/tool_result 配对不被拆散：

def adjust_for_tool_pairs(messages, start_index):
    """确保保留的消息里每个 tool_result 都有对应的 tool_use"""
    # 收集保留范围内所有的 tool_result ID
    needed_tool_use_ids = set()
    for msg in messages[start_index:]:
        for block in msg.content:
            if block.type == "tool_result":
                needed_tool_use_ids.add(block.tool_use_id)

    # 向前搜索，把对应的 tool_use 消息也拉进保留范围
    for i in range(start_index - 1, -1, -1):
        if has_matching_tool_use(messages[i], needed_tool_use_ids):
            start_index = i  # 扩展保留范围

    return start_index

为什么要这样做？因为 Claude API 要求每个 tool_result 必须有对应的 tool_use，否则会报错。如果压缩时恰好在一对 tool 调用中间"切了一刀"，就会出现"孤儿 tool_result"导致后续 API 调用失败。

1.4 Level 2: Full Compact——传统 API 压缩

当 Session Memory 不可用时，调用 Claude API 对整段对话生成摘要。

压缩 prompt 要求生成包含 9 个章节的结构化摘要：

+------- Full Compact Summary Structure -------+
|                                              |
|  1. Primary Request and Intent               |
|  2. Key Technical Concepts                   |
|  3. Files and Code Sections                  |
|  4. Errors and Fixes                         |
|  5. Problem Solving                          |
|  6. All User Messages  <-- verbatim!         |
|  7. Pending Tasks                            |
|  8. Current Work                             |
|  9. Optional Next Step                       |
|                                              |
+----------------------------------------------+
        ^                          ^
        |                          |
  <analysis> scratchpad     #6 preserves user's
  think-then-summarize      exact words to prevent
  stripped before storing   intent drift

三个值得注意的设计：

设计 1：<analysis> 草稿区 —— 让模型先在 <analysis> 标签内自由分析，再在 <summary> 标签内输出最终摘要。草稿区最终会被程序剥离，不进入上下文，但它显著提高了摘要质量。

设计 2：禁止使用工具 —— 压缩 Agent 被严格限制只能输出文本。因为它只有 1 个 turn 的预算，如果尝试调用工具，这个 turn 就浪费了。

设计 3：Prompt Cache 共享 —— 压缩 Agent 通过 "Fork" 机制启动，与主对话共享相同的 system prompt、工具列表和消息前缀，从而复用 prompt cache，大幅降低成本。

1.5 Level 3: PTL 重试——最后的安全网

如果压缩请求本身也超出了 token 限制（prompt-too-long），会触发截断重试：

MAX_PTL_RETRIES = 3

def compact_with_retry(messages, summary_request):
    for attempt in range(MAX_PTL_RETRIES):
        response = call_api(messages + [summary_request])

        if not is_prompt_too_long(response):
            return response  # 成功

        # 按 API 轮次分组，从最旧的开始丢弃
        groups = group_by_api_round(messages)

        if token_gap is not None:
            # 精确模式：根据超出量计算要丢弃多少组
            drop_count = calculate_exact_drop(groups, token_gap)
        else:
            # 回退模式：丢弃最旧的 20%
            drop_count = max(1, len(groups) * 0.2)

        messages = flatten(groups[drop_count:])

    raise Error("对话太长，无法压缩")

1.6 压缩后的状态恢复

压缩不只是删消息。旧消息被丢弃后，很多隐含的上下文也丢了。系统需要恢复关键状态：

+--- 6 Types of State to Restore After Compaction ----+
|                                                     |
|  1. Recent files (max 5, budget 50K tokens)         |
|     Avoid re-reading same files after compact       |
|                                                     |
|  2. Current Plan file                               |
|     Keep the plan if one is in progress             |
|                                                     |
|  3. Invoked Skills content (max 5K tokens each)     |
|     Skill instructions must survive compact         |
|                                                     |
|  4. Async Agent status                              |
|     Is a background agent done? Where's the result? |
|                                                     |
|  5. Deferred tool definitions                       |
|     Re-register tools discovered before compact     |
|                                                     |
|  6. SessionStart hooks (CLAUDE.md etc.)             |
|     Project config must be re-injected into context |
|                                                     |
+-----------------------------------------------------+

同时有一个巧妙的去重逻辑——如果文件内容已经在保留的消息中可见，就跳过恢复，避免浪费 token：

def restore_files_after_compact(read_file_cache, kept_messages):
    # 收集保留消息中已有的 Read 结果路径
    already_visible = collect_read_paths_from(kept_messages)

    recent_files = sorted(read_file_cache, key=lambda f: f.timestamp, reverse=True)

    budget_used = 0
    for file in recent_files[:5]:
        if file.path in already_visible:
            continue  # 跳过！消息里已经有了
        if budget_used + file.tokens > 50_000:
            break
        restore_as_attachment(file)
        budget_used += file.tokens

第二层：Session Memory——后台持续摘要引擎

2.1 核心机制

Session Memory 是 Claude Code 最创新的设计之一。它在后台运行一个独立的子 Agent，持续将对话内容提炼为结构化的 Markdown 笔记文件。

+-- Main Conversation Loop ----------------------------------+
|                                                            |
|  User query -> Model response -> Tool call -> ... loop     |
|                     |                                      |
|           post_sampling_hook fires                         |
|           after each response                              |
|                     |                                      |
|                     v                                      |
|  +--------------------------------------------+            |
|  | Should Session Memory update?              |            |
|  | (check token threshold & tool call count)  |            |
|  +--------+---------------------------+-------+            |
|      Yes  |                           | No                 |
|           v                           +--- skip            |
|  +------------------------+                                |
|  | Fork sub-agent         | <-- separate context,          |
|  | only allowed to Edit   |     non-blocking               |
|  | the summary file       |                                |
|  +-----------+------------+                                |
|              v                                             |
|  Update session_memory.md                                  |
|                                                            |
+------------------------------------------------------------+

2.2 触发条件：双阈值门控

不是每轮对话都会更新摘要，需要满足精心设计的条件：

def should_extract_memory(messages):
    current_tokens = estimate_total_tokens(messages)

    # 门槛 0：对话刚开始，内容太少，不提取
    if not initialized:
        if current_tokens < INIT_THRESHOLD:
            return False
        initialized = True

    # 门槛 1（硬性）：自上次提取以来，token 增长是否足够？
    token_threshold_met = (
        current_tokens - last_extraction_tokens >= MIN_TOKENS_BETWEEN_UPDATE
    )

    # 门槛 2（软性）：自上次提取以来，工具调用次数是否足够？
    tool_call_threshold_met = (
        count_tool_calls_since(last_extraction_msg) >= TOOL_CALLS_BETWEEN_UPDATES
    )

    # 门槛 3（机会性）：最后一轮是否没有工具调用？（自然停顿点）
    at_natural_break = not has_tool_calls_in_last_turn(messages)

    # 触发条件：token 阈值是硬性要求，另外两个满足其一即可
    return token_threshold_met and (tool_call_threshold_met or at_natural_break)

设计哲学：Token 阈值是硬性要求（防止过于频繁提取浪费成本），工具调用次数和"自然停顿"是软性触发（在模型暂停思考的间隙提取，不打断工作流）。

2.3 摘要文件的结构

Session Memory 维护一个固定模板的 Markdown 文件，包含 10 个章节：

+--------- session_memory.md ----------+
|                                      |
|  # Session Title      (5-10 words)   |
|  # Current State      (important!)   |
|  # Task Specification                |
|  # Files and Functions               |
|  # Workflow                          |
|  # Errors & Corrections              |
|  # Codebase and System Docs          |
|  # Learnings                         |
|  # Key Results                       |
|  # Worklog                           |
|                                      |
|  Per-section limit : ~2000 tokens    |
|  Total file limit  : ~12000 tokens   |
+--------------------------------------+

当文件超出预算时，更新 prompt 会要求子 Agent 主动精简：

def build_update_prompt(current_notes, notes_path):
    prompt = render_template(TEMPLATE, current_notes, notes_path)

    # 分析每个章节的大小
    section_sizes = analyze_sections(current_notes)
    total = estimate_tokens(current_notes)

    # 超出总预算 → 强制精简
    if total > 12_000:
        prompt += f"""
        CRITICAL: 文件当前约 {total} tokens，超出上限 12000。
        你必须大幅精简，优先保留 Current State 和 Errors。"""

    # 单章节超限 → 提醒精简该章节
    oversized = [s for s in section_sizes if s.tokens > 2000]
    if oversized:
        prompt += f"\n以下章节需要精简: {oversized}"

    return prompt

2.4 更新方式：Fork Agent 与权限隔离

Session Memory 的更新通过一个受限的 Fork Agent 执行。关键设计——只允许编辑摘要文件，不允许任何其他操作：

def create_memory_file_permission(memory_path):
    """创建权限函数：只允许 Edit 工具操作指定的摘要文件"""
    def can_use_tool(tool, input):
        if tool.name == "FileEdit" and input["file_path"] == memory_path:
            return ALLOW
        return DENY  # 其他任何操作都拒绝

    return can_use_tool

# 启动 Fork Agent
run_forked_agent(
    prompt       = build_update_prompt(current_notes, path),
    can_use_tool = create_memory_file_permission(path),  # 严格权限
    query_source = "session_memory",
    # Fork Agent 共享主对话的 prompt cache，成本更低
)

2.5 用户可定制

Session Memory 支持自定义模板和 prompt，用户可以放置在：

~/.claude/session-memory/config/
├── template.md    ← 自定义摘要模板（替换默认的 10 个章节）
└── prompt.md      ← 自定义更新指令（支持 {{currentNotes}} 等变量）

第三层：持久化记忆（memdir）——跨会话的长期记忆

3.1 架构设计

持久化记忆是一个基于文件系统的知识库：

~/.claude/projects/<项目>/memory/
│
├── MEMORY.md                ← 索引文件（始终加载到 system prompt）
│   内容示例:
│   - [User Preferences](user_preferences.md) — Prefers bun, uses Go
│   - [Project Setup](project_setup.md) — K8s deploy, GH Actions CI
│   - [API Gotchas](api_gotchas.md) — Auth token must be refreshed
│
├── user_preferences.md      ← 用户偏好
├── project_setup.md         ← 项目配置
├── feedback_testing.md      ← 反馈记录
├── api_gotchas.md           ← 参考信息
│
└── team/                    ← 团队共享记忆（可选）
    ├── MEMORY.md
    └── coding_standards.md

3.2 四种记忆类型

系统定义了严格的分类法，每种类型有明确的边界：

+-----------+----------------------------+-------------------------------+
|   Type    |       Description          |         Example               |
+-----------+----------------------------+-------------------------------+
|  user     | Identity, role, prefs      | "Backend dev, prefers Go"     |
+-----------+----------------------------+-------------------------------+
| feedback  | Corrections & feedback     | "Use const, not var"          |
+-----------+----------------------------+-------------------------------+
| project   | Non-code project knowledge | "Deploy: k8s, CI: GH Actions" |
+-----------+----------------------------+-------------------------------+
| reference | Tool / API usage notes     | "Internal API auth method"    |
+-----------+----------------------------+-------------------------------+

     X  NOT saved: info derivable from code (architecture, git history)

每个记忆文件使用 YAML frontmatter 标注元数据：

---
name: user_preferences
description: User's coding style preferences
type: feedback
---
Use bun instead of npm for all package management.
Prefer functional components over class components.

3.3 索引文件的限制

MEMORY.md 始终被加载到 system prompt，因此有严格限制，防止它无限膨胀：

MAX_LINES = 200       # 最多 200 行
MAX_BYTES = 25_000    # 最多 25KB

def truncate_memory_index(raw_content):
    lines = raw_content.strip().split("\n")

    # 先按行数截断
    if len(lines) > MAX_LINES:
        lines = lines[:MAX_LINES]
        truncated_by_lines = True

    # 再按字节数截断（防止单行特别长的情况）
    content = "\n".join(lines)
    if len(content) > MAX_BYTES:
        cut_at = content.rfind("\n", 0, MAX_BYTES)  # 在行边界截断
        content = content[:cut_at]
        truncated_by_bytes = True

    if truncated_by_lines or truncated_by_bytes:
        content += "\n\n> WARNING: MEMORY.md 过大，仅加载了部分内容。"
        content += "\n> 请保持索引条目在一行 ~200 字符以内，详情写入主题文件。"

    return content

3.4 智能记忆检索

并非每次对话都加载所有记忆文件。系统使用一个轻量级的 Sonnet 模型做实时检索，按需加载最相关的记忆：

User sends: "Help me fix auth logic in the payment API"
    |
    v
+----------------------------------------------+
| Scan memory/ dir for all .md files           |
| Read each file's frontmatter                 |
| (name + description + type)                  |
| Exclude files already shown this turn        |
+----------------------+-----------------------+
                       |
                       v
+----------------------------------------------+
| Call Sonnet (lightweight, fast)              |
|                                              |
| System: "Select up to 5 memories that        |
|  will be useful for the current query"       |
|                                              |
| User: "Query: fix payment API auth logic     |
|  Available memories:                         |
|  - api_gotchas.md: API auth caveats    [HIT] |
|  - user_prefs.md: coding preferences   [HIT] |
|  - old_deploy.md: old deploy notes    [SKIP] |
|  Recent tools: FileEdit, Bash"               |
|                                              |
| -> Returns: ["api_gotchas.md",               |
|              "user_prefs.md"]                |
+----------------------+-----------------------+
                       |
                       v
         Load selected memory files into context

这里有两个巧妙的过滤逻辑：

去重：如果某个记忆文件在之前的轮次中已经展示过，就不再推荐，把名额留给新的候选
工具感知：如果用户最近在用某个工具（如 mcp__X__spawn），就不推荐该工具的使用文档（因为对话里已经有了使用示例），但仍然推荐包含 warnings/gotchas 的记忆

3.5 后台记忆提取

除了用户主动说"记住这个"，Claude Code 还会后台自动提取值得记住的内容。提取 Agent 有严格的约束：

+--- Memory Extraction Agent Rules --------+
|                                          |
|  ALLOWED:                                |
|  +- Read / Grep / Glob memory dir files  |
|  +- Read-only Bash (ls/find/cat/stat)    |
|  +- Edit / Write files inside memory dir |
|                                          |
|  DENIED:                                 |
|  +- X  Read or modify source code files  |
|  +- X  Write-capable Bash commands       |
|  +- X  MCP, Agent, or other tools        |
|  +- X  Bash rm (no deleting files)       |
|                                          |
|  EFFICIENCY (limited turn budget):       |
|  +- Turn 1: all Read calls in parallel   |
|  +- Turn 2: all Write/Edit in parallel   |
|                                          |
|  CONTENT CONSTRAINT:                     |
|  +- Extract from conversation only,      |
|     never investigate source code        |
|                                          |
+------------------------------------------+

三层记忆如何协作？

用一个完整的时间线来看三层记忆如何配合：

Timeline ─────────────────────────────────────────────────────►

▎ SESSION START
▎ ├─ Load MEMORY.md index into system prompt ··········  Persistent Memory
▎ ├─ Sonnet retrieves relevant memory files ···········  Persistent Memory
▎
▎ CONVERSATION IN PROGRESS...
▎ ├─ Each model response -> post_sampling_hook ········  Session Memory
▎ │                                                     (background update)
▎ │
▎ │  [Conversation grows ~5K tokens]
▎ │  └─ Session Memory updates session_memory.md
▎ │
▎ │  [Conversation continues...]
▎ │  └─ Memory Extraction Agent checks for ············  Persistent Memory
▎ │     save-worthy content (preferences, feedback)      (background write)
▎ │
▎ │  [Context approaching 93%]
▎ │  ├─ Try SM Compact ································  Context Management
▎ │  │   ├─ Success -> keep recent 10K-40K tokens
▎ │  │   └─ Fail -> Full Compact -> PTL Retry
▎ │  └─ Restore files, Plan, Skills, Agent state
▎ │
▎ │  [Conversation continues... (loop)]
▎
▎ SESSION END
▎ ├─ session_memory.md preserved (resumable) ··········  Session Memory
▎ └─ memory/*.md persisted permanently ················  Persistent Memory
▎
▎ NEXT SESSION START
▎ ├─ Reload MEMORY.md
▎ ├─ Retrieve relevant memory files
▎ └─ If resuming -> read session_memory.md as context

工程细节亮点

Prompt Cache 保护

压缩是 prompt cache 的天敌——消息全变了，cache 就失效了。系统的应对策略：

Before compaction:
  Main prompt cache: [system_prompt | tools | msg1...msgN]
                     <---- cache hit ---->

After compaction:
  New messages: [system_prompt | tools | summary | recent_msgs]
                <-- still hit! -->
                (system + tools unchanged)

  + Notify cache monitor: "compaction just happened,
    cache hit-rate drop is expected, suppress alerts"

Fork Agent 的设计也是为了最大化 cache 复用——压缩 Agent 发送的请求前缀与主对话完全相同（system prompt + tools + 消息前缀），从而命中同一份 cache。

图片和媒体的处理

压缩前会剥离图片和文档附件（它们对生成文字摘要无用，但占大量 token）：

def strip_images(messages):
    """图片和文档 → 文字占位符"""
    for msg in messages:
        for block in msg.content:
            if block.type == "image":
                replace_with("[image]")
            elif block.type == "document":
                replace_with("[document]")
    return messages

会话元数据的持久化

一个容易忽略的细节：压缩后需要重新追加会话元数据（自定义标题、标签）。因为 --resume 功能通过读取日志文件的尾部 16KB 来获取会话信息，如果压缩后产生大量新消息，元数据会被挤出这个窗口，导致显示自动生成的标题而非用户设置的名称。

总结

Claude Code 的记忆管理系统是一个精心设计的三层架构：

层级	机制	生命周期	成本
上下文窗口	三级压缩（SM → Full → PTL）	单次对话	SM: 零, Full: 1 次 API
Session Memory	后台 Fork Agent 持续摘要	单次会话	每 ~5K tokens 一次轻量 API
持久化记忆	文件系统 + Sonnet 智能检索	永久	检索: 1 次 Sonnet 调用

最核心的设计哲学：不是追求更大的上下文窗口，而是在有限窗口内做到"记得住、忘得对、找得到"。

记得住：Session Memory 持续提炼，不遗漏关键信息
忘得对：三级压缩保留最重要的内容，丢弃冗余的工具调用细节
找得到：memdir 按类型组织 + Sonnet 智能检索，精准召回

这套系统让 Claude Code 在 200K 的上下文窗口内，实现了远超 200K 的"等效记忆容量"。

本文所有代码块均为基于实际行为和公开信息推断的 Python 伪代码或示意图，用于帮助理解 Claude Code 的内部机制。