深入拆解Claude Code的记忆管理机制
用过Claude Code的开发者可能都有这样的体验:即便在一次超长对话中修改了几十个文件,它似乎始终"记得"之前做过什么。更神奇的是,你在上一次对话里告诉它"我喜欢用bun而不是npm",下次它就自动遵守了。
这背后是一套精密的记忆管理系统。今天,我们把Claude Code的记忆机制彻底拆解开来。
全局架构:三层记忆体系
Claude Code 的记忆管理可以类比人类的记忆系统:
+----------------------------------------------------------+
| Persistent Memory (Long-term Memory) |
| memdir: MEMORY.md index + topic files |
| Persisted across sessions |
+----------------------------------------------------------+
| Session Memory (Working Memory) |
| Background sub-agent maintains Markdown summary |
| Valid within a single session |
+----------------------------------------------------------+
| Context Window (Short-term Memory) |
| Raw messages + tool call results in current conversation|
| Immediately available |
+----------------------------------------------------------+
接下来逐层拆解。
第一层:上下文窗口管理——三级压缩策略
1.1 何时触发压缩?
自动压缩的触发逻辑,用伪代码表达如下:
BUFFER_TOKENS = 13_000 # 预留缓冲区
def get_auto_compact_threshold(model):
"""计算自动压缩阈值 ≈ 有效窗口的 ~93%"""
effective_window = get_context_window(model) - MAX_OUTPUT_TOKENS
return effective_window - BUFFER_TOKENS
def should_auto_compact(messages, model):
token_count = estimate_token_count(messages)
threshold = get_auto_compact_threshold(model)
return token_count >= threshold
以 200K 上下文窗口为例,大约在 93% 使用率时自动启动压缩。
此外还有一个熔断机制——连续失败 3 次后停止重试,防止"死循环式"地浪费 API 调用:
MAX_FAILURES = 3
def auto_compact_if_needed(messages, tracking):
# 熔断:连续失败太多次,直接放弃
if tracking.consecutive_failures >= MAX_FAILURES:
return NOT_COMPACTED
if not should_auto_compact(messages):
return NOT_COMPACTED
# ... 执行压缩
1.2 三级压缩的执行顺序
当压缩被触发时,系统按优先级尝试三种策略:
+--------------------------+
| Context reaching ~93% |
+------------+-------------+
|
v
+---------------------------------+
| Level 1: SM Compact |
| Replace old msgs with | <-- Zero API calls
| pre-built summary |
+------+----------------+---------+
OK | | Fail / N/A
v v
return +---------------------------------+
| Level 2: Full Compact |
| Call API to generate | <-- 1 API call
| structured summary |
+------+----------------+---------+
OK | | prompt-too-long
v v
return +---------------------------------+
| Level 3: PTL Retry |
| Truncate oldest messages, | <-- Max 3 retries
| retry compaction |
+---------------------------------+
整体流程的伪代码:
def auto_compact_if_needed(messages, context):
# === 优先尝试 Session Memory 压缩(零成本)===
result = try_session_memory_compaction(messages)
if result:
return result # 成功!没有调用任何 API
# === 回退到传统 API 压缩 ===
result = full_compact_conversation(messages, context)
return result
1.3 Level 1: Session Memory Compact——零成本压缩
这是最巧妙的压缩策略。当 Session Memory 文件可用时,直接用它替换旧的对话历史,不需要任何额外的 API 调用。
保留消息的策略由三个参数控制:
SM_COMPACT_CONFIG = {
"min_tokens": 10_000, # 至少保留 10K tokens 的原始消息
"min_text_messages": 5, # 至少保留 5 条含文本的消息
"max_tokens": 40_000, # 最多保留 40K tokens(硬上限)
}
压缩前后的对比如下图所示:
Before (context window almost full):
+-----------------------------------------------------+
| msg1 | msg2 | ... | msg50 | msg51 | ... | msg80 |
|<-- covered by SM summary -->|<-- recent messages -->|
+-----------------------------------------------------+
| SM Compact
v
After (lots of free space):
+-----------------------------------------------------+
| [boundary] | SM summary | msg51 | ... | msg80 |
| | (~12K toks) |<-- kept raw messages --> |
| | | (10K ~ 40K tokens) |
| <----- used -----------> |<---- free space -------->|
+-----------------------------------------------------+
决定保留哪些消息的核心算法:
def calculate_messages_to_keep(messages, last_summarized_idx):
"""从最后一条已摘要的消息开始,向前扩展直到满足条件"""
config = SM_COMPACT_CONFIG
start = last_summarized_idx + 1
total_tokens = 0
text_msg_count = 0
# 先统计 start 之后已有多少
for msg in messages[start:]:
total_tokens += estimate_tokens(msg)
if has_text_content(msg):
text_msg_count += 1
# 如果已经达到硬上限或同时满足两个下限,直接返回
if total_tokens >= config["max_tokens"]:
return start
if total_tokens >= config["min_tokens"] and \
text_msg_count >= config["min_text_messages"]:
return start
# 否则向前扩展,拉入更多消息
for i in range(start - 1, -1, -1):
total_tokens += estimate_tokens(messages[i])
if has_text_content(messages[i]):
text_msg_count += 1
start = i
if total_tokens >= config["max_tokens"]:
break
if total_tokens >= config["min_tokens"] and \
text_msg_count >= config["min_text_messages"]:
break
# 最后:保护 tool_use/tool_result 配对不被拆散
return adjust_for_tool_pairs(messages, start)
这里有一个精妙的细节——保护 tool_use/tool_result 配对不被拆散:
def adjust_for_tool_pairs(messages, start_index):
"""确保保留的消息里每个 tool_result 都有对应的 tool_use"""
# 收集保留范围内所有的 tool_result ID
needed_tool_use_ids = set()
for msg in messages[start_index:]:
for block in msg.content:
if block.type == "tool_result":
needed_tool_use_ids.add(block.tool_use_id)
# 向前搜索,把对应的 tool_use 消息也拉进保留范围
for i in range(start_index - 1, -1, -1):
if has_matching_tool_use(messages[i], needed_tool_use_ids):
start_index = i # 扩展保留范围
return start_index
为什么要这样做?因为 Claude API 要求每个 tool_result
必须有对应的 tool_use,否则会报错。如果压缩时恰好在一对
tool 调用中间"切了一刀",就会出现"孤儿 tool_result"导致后续 API
调用失败。
1.4 Level 2: Full Compact——传统 API 压缩
当 Session Memory 不可用时,调用 Claude API 对整段对话生成摘要。
压缩 prompt 要求生成包含 9 个章节的结构化摘要:
+------- Full Compact Summary Structure -------+
| |
| 1. Primary Request and Intent |
| 2. Key Technical Concepts |
| 3. Files and Code Sections |
| 4. Errors and Fixes |
| 5. Problem Solving |
| 6. All User Messages <-- verbatim! |
| 7. Pending Tasks |
| 8. Current Work |
| 9. Optional Next Step |
| |
+----------------------------------------------+
^ ^
| |
<analysis> scratchpad #6 preserves user's
think-then-summarize exact words to prevent
stripped before storing intent drift
三个值得注意的设计:
设计 1:<analysis> 草稿区 ——
让模型先在 <analysis> 标签内自由分析,再在
<summary>
标签内输出最终摘要。草稿区最终会被程序剥离,不进入上下文,但它显著提高了摘要质量。
设计 2:禁止使用工具 —— 压缩 Agent 被严格限制只能输出文本。因为它只有 1 个 turn 的预算,如果尝试调用工具,这个 turn 就浪费了。
设计 3:Prompt Cache 共享 —— 压缩 Agent 通过 "Fork" 机制启动,与主对话共享相同的 system prompt、工具列表和消息前缀,从而复用 prompt cache,大幅降低成本。
1.5 Level 3: PTL 重试——最后的安全网
如果压缩请求本身也超出了 token 限制(prompt-too-long),会触发截断重试:
MAX_PTL_RETRIES = 3
def compact_with_retry(messages, summary_request):
for attempt in range(MAX_PTL_RETRIES):
response = call_api(messages + [summary_request])
if not is_prompt_too_long(response):
return response # 成功
# 按 API 轮次分组,从最旧的开始丢弃
groups = group_by_api_round(messages)
if token_gap is not None:
# 精确模式:根据超出量计算要丢弃多少组
drop_count = calculate_exact_drop(groups, token_gap)
else:
# 回退模式:丢弃最旧的 20%
drop_count = max(1, len(groups) * 0.2)
messages = flatten(groups[drop_count:])
raise Error("对话太长,无法压缩")
1.6 压缩后的状态恢复
压缩不只是删消息。旧消息被丢弃后,很多隐含的上下文也丢了。系统需要恢复关键状态:
+--- 6 Types of State to Restore After Compaction ----+
| |
| 1. Recent files (max 5, budget 50K tokens) |
| Avoid re-reading same files after compact |
| |
| 2. Current Plan file |
| Keep the plan if one is in progress |
| |
| 3. Invoked Skills content (max 5K tokens each) |
| Skill instructions must survive compact |
| |
| 4. Async Agent status |
| Is a background agent done? Where's the result? |
| |
| 5. Deferred tool definitions |
| Re-register tools discovered before compact |
| |
| 6. SessionStart hooks (CLAUDE.md etc.) |
| Project config must be re-injected into context |
| |
+-----------------------------------------------------+
同时有一个巧妙的去重逻辑——如果文件内容已经在保留的消息中可见,就跳过恢复,避免浪费 token:
def restore_files_after_compact(read_file_cache, kept_messages):
# 收集保留消息中已有的 Read 结果路径
already_visible = collect_read_paths_from(kept_messages)
recent_files = sorted(read_file_cache, key=lambda f: f.timestamp, reverse=True)
budget_used = 0
for file in recent_files[:5]:
if file.path in already_visible:
continue # 跳过!消息里已经有了
if budget_used + file.tokens > 50_000:
break
restore_as_attachment(file)
budget_used += file.tokens
第二层:Session Memory——后台持续摘要引擎
2.1 核心机制
Session Memory 是 Claude Code 最创新的设计之一。它在后台运行一个独立的子 Agent,持续将对话内容提炼为结构化的 Markdown 笔记文件。
+-- Main Conversation Loop ----------------------------------+
| |
| User query -> Model response -> Tool call -> ... loop |
| | |
| post_sampling_hook fires |
| after each response |
| | |
| v |
| +--------------------------------------------+ |
| | Should Session Memory update? | |
| | (check token threshold & tool call count) | |
| +--------+---------------------------+-------+ |
| Yes | | No |
| v +--- skip |
| +------------------------+ |
| | Fork sub-agent | <-- separate context, |
| | only allowed to Edit | non-blocking |
| | the summary file | |
| +-----------+------------+ |
| v |
| Update session_memory.md |
| |
+------------------------------------------------------------+
2.2 触发条件:双阈值门控
不是每轮对话都会更新摘要,需要满足精心设计的条件:
def should_extract_memory(messages):
current_tokens = estimate_total_tokens(messages)
# 门槛 0:对话刚开始,内容太少,不提取
if not initialized:
if current_tokens < INIT_THRESHOLD:
return False
initialized = True
# 门槛 1(硬性):自上次提取以来,token 增长是否足够?
token_threshold_met = (
current_tokens - last_extraction_tokens >= MIN_TOKENS_BETWEEN_UPDATE
)
# 门槛 2(软性):自上次提取以来,工具调用次数是否足够?
tool_call_threshold_met = (
count_tool_calls_since(last_extraction_msg) >= TOOL_CALLS_BETWEEN_UPDATES
)
# 门槛 3(机会性):最后一轮是否没有工具调用?(自然停顿点)
at_natural_break = not has_tool_calls_in_last_turn(messages)
# 触发条件:token 阈值是硬性要求,另外两个满足其一即可
return token_threshold_met and (tool_call_threshold_met or at_natural_break)
设计哲学:Token 阈值是硬性要求(防止过于频繁提取浪费成本),工具调用次数和"自然停顿"是软性触发(在模型暂停思考的间隙提取,不打断工作流)。
2.3 摘要文件的结构
Session Memory 维护一个固定模板的 Markdown 文件,包含 10 个章节:
+--------- session_memory.md ----------+
| |
| # Session Title (5-10 words) |
| # Current State (important!) |
| # Task Specification |
| # Files and Functions |
| # Workflow |
| # Errors & Corrections |
| # Codebase and System Docs |
| # Learnings |
| # Key Results |
| # Worklog |
| |
| Per-section limit : ~2000 tokens |
| Total file limit : ~12000 tokens |
+--------------------------------------+
当文件超出预算时,更新 prompt 会要求子 Agent 主动精简:
def build_update_prompt(current_notes, notes_path):
prompt = render_template(TEMPLATE, current_notes, notes_path)
# 分析每个章节的大小
section_sizes = analyze_sections(current_notes)
total = estimate_tokens(current_notes)
# 超出总预算 → 强制精简
if total > 12_000:
prompt += f"""
CRITICAL: 文件当前约 {total} tokens,超出上限 12000。
你必须大幅精简,优先保留 Current State 和 Errors。"""
# 单章节超限 → 提醒精简该章节
oversized = [s for s in section_sizes if s.tokens > 2000]
if oversized:
prompt += f"\n以下章节需要精简: {oversized}"
return prompt
2.4 更新方式:Fork Agent 与权限隔离
Session Memory 的更新通过一个受限的 Fork Agent 执行。关键设计——只允许编辑摘要文件,不允许任何其他操作:
def create_memory_file_permission(memory_path):
"""创建权限函数:只允许 Edit 工具操作指定的摘要文件"""
def can_use_tool(tool, input):
if tool.name == "FileEdit" and input["file_path"] == memory_path:
return ALLOW
return DENY # 其他任何操作都拒绝
return can_use_tool
# 启动 Fork Agent
run_forked_agent(
prompt = build_update_prompt(current_notes, path),
can_use_tool = create_memory_file_permission(path), # 严格权限
query_source = "session_memory",
# Fork Agent 共享主对话的 prompt cache,成本更低
)
2.5 用户可定制
Session Memory 支持自定义模板和 prompt,用户可以放置在:
~/.claude/session-memory/config/
├── template.md ← 自定义摘要模板(替换默认的 10 个章节)
└── prompt.md ← 自定义更新指令(支持 {{currentNotes}} 等变量)
第三层:持久化记忆(memdir)——跨会话的长期记忆
3.1 架构设计
持久化记忆是一个基于文件系统的知识库:
~/.claude/projects/<项目>/memory/
│
├── MEMORY.md ← 索引文件(始终加载到 system prompt)
│ 内容示例:
│ - [User Preferences](user_preferences.md) — Prefers bun, uses Go
│ - [Project Setup](project_setup.md) — K8s deploy, GH Actions CI
│ - [API Gotchas](api_gotchas.md) — Auth token must be refreshed
│
├── user_preferences.md ← 用户偏好
├── project_setup.md ← 项目配置
├── feedback_testing.md ← 反馈记录
├── api_gotchas.md ← 参考信息
│
└── team/ ← 团队共享记忆(可选)
├── MEMORY.md
└── coding_standards.md
3.2 四种记忆类型
系统定义了严格的分类法,每种类型有明确的边界:
+-----------+----------------------------+-------------------------------+
| Type | Description | Example |
+-----------+----------------------------+-------------------------------+
| user | Identity, role, prefs | "Backend dev, prefers Go" |
+-----------+----------------------------+-------------------------------+
| feedback | Corrections & feedback | "Use const, not var" |
+-----------+----------------------------+-------------------------------+
| project | Non-code project knowledge | "Deploy: k8s, CI: GH Actions" |
+-----------+----------------------------+-------------------------------+
| reference | Tool / API usage notes | "Internal API auth method" |
+-----------+----------------------------+-------------------------------+
X NOT saved: info derivable from code (architecture, git history)
每个记忆文件使用 YAML frontmatter 标注元数据:
---
name: user_preferences
description: User's coding style preferences
type: feedback
---
Use bun instead of npm for all package management.
Prefer functional components over class components.
3.3 索引文件的限制
MEMORY.md 始终被加载到 system
prompt,因此有严格限制,防止它无限膨胀:
MAX_LINES = 200 # 最多 200 行
MAX_BYTES = 25_000 # 最多 25KB
def truncate_memory_index(raw_content):
lines = raw_content.strip().split("\n")
# 先按行数截断
if len(lines) > MAX_LINES:
lines = lines[:MAX_LINES]
truncated_by_lines = True
# 再按字节数截断(防止单行特别长的情况)
content = "\n".join(lines)
if len(content) > MAX_BYTES:
cut_at = content.rfind("\n", 0, MAX_BYTES) # 在行边界截断
content = content[:cut_at]
truncated_by_bytes = True
if truncated_by_lines or truncated_by_bytes:
content += "\n\n> WARNING: MEMORY.md 过大,仅加载了部分内容。"
content += "\n> 请保持索引条目在一行 ~200 字符以内,详情写入主题文件。"
return content
3.4 智能记忆检索
并非每次对话都加载所有记忆文件。系统使用一个轻量级的 Sonnet 模型做实时检索,按需加载最相关的记忆:
User sends: "Help me fix auth logic in the payment API"
|
v
+----------------------------------------------+
| Scan memory/ dir for all .md files |
| Read each file's frontmatter |
| (name + description + type) |
| Exclude files already shown this turn |
+----------------------+-----------------------+
|
v
+----------------------------------------------+
| Call Sonnet (lightweight, fast) |
| |
| System: "Select up to 5 memories that |
| will be useful for the current query" |
| |
| User: "Query: fix payment API auth logic |
| Available memories: |
| - api_gotchas.md: API auth caveats [HIT] |
| - user_prefs.md: coding preferences [HIT] |
| - old_deploy.md: old deploy notes [SKIP] |
| Recent tools: FileEdit, Bash" |
| |
| -> Returns: ["api_gotchas.md", |
| "user_prefs.md"] |
+----------------------+-----------------------+
|
v
Load selected memory files into context
这里有两个巧妙的过滤逻辑:
- 去重:如果某个记忆文件在之前的轮次中已经展示过,就不再推荐,把名额留给新的候选
- 工具感知:如果用户最近在用某个工具(如
mcp__X__spawn),就不推荐该工具的使用文档(因为对话里已经有了使用示例),但仍然推荐包含 warnings/gotchas 的记忆
3.5 后台记忆提取
除了用户主动说"记住这个",Claude Code 还会后台自动提取值得记住的内容。提取 Agent 有严格的约束:
+--- Memory Extraction Agent Rules --------+
| |
| ALLOWED: |
| +- Read / Grep / Glob memory dir files |
| +- Read-only Bash (ls/find/cat/stat) |
| +- Edit / Write files inside memory dir |
| |
| DENIED: |
| +- X Read or modify source code files |
| +- X Write-capable Bash commands |
| +- X MCP, Agent, or other tools |
| +- X Bash rm (no deleting files) |
| |
| EFFICIENCY (limited turn budget): |
| +- Turn 1: all Read calls in parallel |
| +- Turn 2: all Write/Edit in parallel |
| |
| CONTENT CONSTRAINT: |
| +- Extract from conversation only, |
| never investigate source code |
| |
+------------------------------------------+
三层记忆如何协作?
用一个完整的时间线来看三层记忆如何配合:
Timeline ─────────────────────────────────────────────────────►
▎ SESSION START
▎ ├─ Load MEMORY.md index into system prompt ·········· Persistent Memory
▎ ├─ Sonnet retrieves relevant memory files ··········· Persistent Memory
▎
▎ CONVERSATION IN PROGRESS...
▎ ├─ Each model response -> post_sampling_hook ········ Session Memory
▎ │ (background update)
▎ │
▎ │ [Conversation grows ~5K tokens]
▎ │ └─ Session Memory updates session_memory.md
▎ │
▎ │ [Conversation continues...]
▎ │ └─ Memory Extraction Agent checks for ············ Persistent Memory
▎ │ save-worthy content (preferences, feedback) (background write)
▎ │
▎ │ [Context approaching 93%]
▎ │ ├─ Try SM Compact ································ Context Management
▎ │ │ ├─ Success -> keep recent 10K-40K tokens
▎ │ │ └─ Fail -> Full Compact -> PTL Retry
▎ │ └─ Restore files, Plan, Skills, Agent state
▎ │
▎ │ [Conversation continues... (loop)]
▎
▎ SESSION END
▎ ├─ session_memory.md preserved (resumable) ·········· Session Memory
▎ └─ memory/*.md persisted permanently ················ Persistent Memory
▎
▎ NEXT SESSION START
▎ ├─ Reload MEMORY.md
▎ ├─ Retrieve relevant memory files
▎ └─ If resuming -> read session_memory.md as context
工程细节亮点
Prompt Cache 保护
压缩是 prompt cache 的天敌——消息全变了,cache 就失效了。系统的应对策略:
Before compaction:
Main prompt cache: [system_prompt | tools | msg1...msgN]
<---- cache hit ---->
After compaction:
New messages: [system_prompt | tools | summary | recent_msgs]
<-- still hit! -->
(system + tools unchanged)
+ Notify cache monitor: "compaction just happened,
cache hit-rate drop is expected, suppress alerts"
Fork Agent 的设计也是为了最大化 cache 复用——压缩 Agent 发送的请求前缀与主对话完全相同(system prompt + tools + 消息前缀),从而命中同一份 cache。
图片和媒体的处理
压缩前会剥离图片和文档附件(它们对生成文字摘要无用,但占大量 token):
def strip_images(messages):
"""图片和文档 → 文字占位符"""
for msg in messages:
for block in msg.content:
if block.type == "image":
replace_with("[image]")
elif block.type == "document":
replace_with("[document]")
return messages
会话元数据的持久化
一个容易忽略的细节:压缩后需要重新追加会话元数据(自定义标题、标签)。因为
--resume 功能通过读取日志文件的尾部 16KB
来获取会话信息,如果压缩后产生大量新消息,元数据会被挤出这个窗口,导致显示自动生成的标题而非用户设置的名称。
总结
Claude Code 的记忆管理系统是一个精心设计的三层架构:
| 层级 | 机制 | 生命周期 | 成本 |
|---|---|---|---|
| 上下文窗口 | 三级压缩(SM → Full → PTL) | 单次对话 | SM: 零, Full: 1 次 API |
| Session Memory | 后台 Fork Agent 持续摘要 | 单次会话 | 每 ~5K tokens 一次轻量 API |
| 持久化记忆 | 文件系统 + Sonnet 智能检索 | 永久 | 检索: 1 次 Sonnet 调用 |
最核心的设计哲学:不是追求更大的上下文窗口,而是在有限窗口内做到"记得住、忘得对、找得到"。
- 记得住:Session Memory 持续提炼,不遗漏关键信息
- 忘得对:三级压缩保留最重要的内容,丢弃冗余的工具调用细节
- 找得到:memdir 按类型组织 + Sonnet 智能检索,精准召回
这套系统让 Claude Code 在 200K 的上下文窗口内,实现了远超 200K 的"等效记忆容量"。
本文所有代码块均为基于实际行为和公开信息推断的 Python 伪代码或示意图,用于帮助理解 Claude Code 的内部机制。