A Deep Dive into Claude Code's Memory Management
Developers who've used Claude Code probably share this experience: even in an ultra-long conversation where dozens of files have been modified, it seems to always "remember" what it did before. Even more remarkably, if you told it "I prefer bun over npm" in a previous session, it automatically follows that preference next time.
Behind this is a sophisticated memory management system. Let's tear apart Claude Code's memory mechanism layer by layer.
Global Architecture: A Three-Layer Memory System
Claude Code's memory management can be compared to the human memory system:
+----------------------------------------------------------+
| Persistent Memory (Long-term Memory) |
| memdir: MEMORY.md index + topic files |
| Persisted across sessions |
+----------------------------------------------------------+
| Session Memory (Working Memory) |
| Background sub-agent maintains Markdown summary |
| Valid within a single session |
+----------------------------------------------------------+
| Context Window (Short-term Memory) |
| Raw messages + tool call results in current conversation|
| Immediately available |
+----------------------------------------------------------+
Let's examine each layer.
Layer 1: Context Window Management — A Three-Level Compaction Strategy
1.1 When Does Compaction Trigger?
The auto-compaction trigger logic, expressed in pseudocode:
BUFFER_TOKENS = 13_000 # Reserved buffer
def get_auto_compact_threshold(model):
"""Calculate auto-compact threshold ≈ ~93% of effective window"""
effective_window = get_context_window(model) - MAX_OUTPUT_TOKENS
return effective_window - BUFFER_TOKENS
def should_auto_compact(messages, model):
token_count = estimate_token_count(messages)
threshold = get_auto_compact_threshold(model)
return token_count >= threshold
Using a 200K context window as an example, compaction kicks in at roughly 93% utilization.
There's also a circuit breaker — after 3 consecutive failures, it stops retrying to prevent wasting API calls in a "death loop":
MAX_FAILURES = 3
def auto_compact_if_needed(messages, tracking):
# Circuit breaker: too many consecutive failures, give up
if tracking.consecutive_failures >= MAX_FAILURES:
return NOT_COMPACTED
if not should_auto_compact(messages):
return NOT_COMPACTED
# ... execute compaction
1.2 Three-Level Compaction Execution Order
When compaction triggers, the system tries three strategies in priority order:
+--------------------------+
| Context reaching ~93% |
+------------+-------------+
|
v
+---------------------------------+
| Level 1: SM Compact |
| Replace old msgs with | <-- Zero API calls
| pre-built summary |
+------+----------------+---------+
OK | | Fail / N/A
v v
return +---------------------------------+
| Level 2: Full Compact |
| Call API to generate | <-- 1 API call
| structured summary |
+------+----------------+---------+
OK | | prompt-too-long
v v
return +---------------------------------+
| Level 3: PTL Retry |
| Truncate oldest messages, | <-- Max 3 retries
| retry compaction |
+---------------------------------+
The overall flow in pseudocode:
def auto_compact_if_needed(messages, context):
# === Try Session Memory compaction first (zero cost) ===
result = try_session_memory_compaction(messages)
if result:
return result # Success! No API calls made
# === Fall back to traditional API compaction ===
result = full_compact_conversation(messages, context)
return result
1.3 Level 1: Session Memory Compact — Zero-Cost Compaction
This is the most elegant compaction strategy. When a Session Memory file is available, it directly replaces old conversation history without any additional API calls.
The message retention policy is controlled by three parameters:
SM_COMPACT_CONFIG = {
"min_tokens": 10_000, # Keep at least 10K tokens of raw messages
"min_text_messages": 5, # Keep at least 5 text-containing messages
"max_tokens": 40_000, # Keep at most 40K tokens (hard cap)
}
Before and after comparison:
Before (context window almost full):
+-----------------------------------------------------+
| msg1 | msg2 | ... | msg50 | msg51 | ... | msg80 |
|<-- covered by SM summary -->|<-- recent messages -->|
+-----------------------------------------------------+
| SM Compact
v
After (lots of free space):
+-----------------------------------------------------+
| [boundary] | SM summary | msg51 | ... | msg80 |
| | (~12K toks) |<-- kept raw messages --> |
| | | (10K ~ 40K tokens) |
| <----- used -----------> |<---- free space -------->|
+-----------------------------------------------------+
The core algorithm for deciding which messages to keep:
def calculate_messages_to_keep(messages, last_summarized_idx):
"""Start from the last summarized message, expand backward until conditions are met"""
config = SM_COMPACT_CONFIG
start = last_summarized_idx + 1
total_tokens = 0
text_msg_count = 0
# First count what's already after start
for msg in messages[start:]:
total_tokens += estimate_tokens(msg)
if has_text_content(msg):
text_msg_count += 1
# If already at hard cap or both minimums satisfied, return
if total_tokens >= config["max_tokens"]:
return start
if total_tokens >= config["min_tokens"] and \
text_msg_count >= config["min_text_messages"]:
return start
# Otherwise expand backward, pulling in more messages
for i in range(start - 1, -1, -1):
total_tokens += estimate_tokens(messages[i])
if has_text_content(messages[i]):
text_msg_count += 1
start = i
if total_tokens >= config["max_tokens"]:
break
if total_tokens >= config["min_tokens"] and \
text_msg_count >= config["min_text_messages"]:
break
# Finally: protect tool_use/tool_result pairs from being split
return adjust_for_tool_pairs(messages, start)
There's an elegant detail here — protecting tool_use/tool_result pairs from being split apart:
def adjust_for_tool_pairs(messages, start_index):
"""Ensure every tool_result in the retained range has its matching tool_use"""
# Collect all tool_result IDs in the retained range
needed_tool_use_ids = set()
for msg in messages[start_index:]:
for block in msg.content:
if block.type == "tool_result":
needed_tool_use_ids.add(block.tool_use_id)
# Search backward, pulling matching tool_use messages into retained range
for i in range(start_index - 1, -1, -1):
if has_matching_tool_use(messages[i], needed_tool_use_ids):
start_index = i # Expand retained range
return start_index
Why is this necessary? Because the Claude API requires every
tool_result to have a corresponding tool_use,
or it throws an error. If compaction happens to "cut" in the middle of a
tool call pair, an "orphan tool_result" would cause subsequent API calls
to fail.
1.4 Level 2: Full Compact — Traditional API Compaction
When Session Memory isn't available, it calls the Claude API to generate a summary of the entire conversation.
The compaction prompt requires a structured summary with 9 sections:
+------- Full Compact Summary Structure -------+
| |
| 1. Primary Request and Intent |
| 2. Key Technical Concepts |
| 3. Files and Code Sections |
| 4. Errors and Fixes |
| 5. Problem Solving |
| 6. All User Messages <-- verbatim! |
| 7. Pending Tasks |
| 8. Current Work |
| 9. Optional Next Step |
| |
+----------------------------------------------+
^ ^
| |
<analysis> scratchpad #6 preserves user's
think-then-summarize exact words to prevent
stripped before storing intent drift
Three noteworthy design choices:
Design 1: <analysis> scratchpad —
The model first freely analyzes within <analysis>
tags, then outputs the final summary within <summary>
tags. The scratchpad is stripped by the program and never enters the
context, but it significantly improves summary quality.
Design 2: Tool use prohibited — The compaction agent is strictly limited to text output only. Since it has a budget of just 1 turn, attempting a tool call would waste that turn entirely.
Design 3: Prompt cache sharing — The compaction agent launches via a "Fork" mechanism, sharing the same system prompt, tool list, and message prefix with the main conversation, thereby reusing the prompt cache and significantly reducing costs.
1.5 Level 3: PTL Retry — The Last Safety Net
If the compaction request itself exceeds the token limit (prompt-too-long), truncation retry kicks in:
MAX_PTL_RETRIES = 3
def compact_with_retry(messages, summary_request):
for attempt in range(MAX_PTL_RETRIES):
response = call_api(messages + [summary_request])
if not is_prompt_too_long(response):
return response # Success
# Group by API round, discard from oldest
groups = group_by_api_round(messages)
if token_gap is not None:
# Precise mode: calculate exactly how many groups to drop
drop_count = calculate_exact_drop(groups, token_gap)
else:
# Fallback mode: drop oldest 20%
drop_count = max(1, len(groups) * 0.2)
messages = flatten(groups[drop_count:])
raise Error("Conversation too long, cannot compact")
1.6 State Restoration After Compaction
Compaction isn't just about deleting messages. When old messages are discarded, much implicit context is lost too. The system needs to restore critical state:
+--- 6 Types of State to Restore After Compaction ----+
| |
| 1. Recent files (max 5, budget 50K tokens) |
| Avoid re-reading same files after compact |
| |
| 2. Current Plan file |
| Keep the plan if one is in progress |
| |
| 3. Invoked Skills content (max 5K tokens each) |
| Skill instructions must survive compact |
| |
| 4. Async Agent status |
| Is a background agent done? Where's the result? |
| |
| 5. Deferred tool definitions |
| Re-register tools discovered before compact |
| |
| 6. SessionStart hooks (CLAUDE.md etc.) |
| Project config must be re-injected into context |
| |
+-----------------------------------------------------+
There's also a clever deduplication logic — if file contents are already visible in the retained messages, skip restoration to avoid wasting tokens:
def restore_files_after_compact(read_file_cache, kept_messages):
# Collect Read result paths already in retained messages
already_visible = collect_read_paths_from(kept_messages)
recent_files = sorted(read_file_cache, key=lambda f: f.timestamp, reverse=True)
budget_used = 0
for file in recent_files[:5]:
if file.path in already_visible:
continue # Skip! Already in the messages
if budget_used + file.tokens > 50_000:
break
restore_as_attachment(file)
budget_used += file.tokens
Layer 2: Session Memory — The Background Continuous Summarization Engine
2.1 Core Mechanism
Session Memory is one of Claude Code's most innovative designs. It runs an independent sub-agent in the background that continuously distills conversation content into a structured Markdown notes file.
+-- Main Conversation Loop ----------------------------------+
| |
| User query -> Model response -> Tool call -> ... loop |
| | |
| post_sampling_hook fires |
| after each response |
| | |
| v |
| +--------------------------------------------+ |
| | Should Session Memory update? | |
| | (check token threshold & tool call count) | |
| +--------+---------------------------+-------+ |
| Yes | | No |
| v +--- skip |
| +------------------------+ |
| | Fork sub-agent | <-- separate context, |
| | only allowed to Edit | non-blocking |
| | the summary file | |
| +-----------+------------+ |
| v |
| Update session_memory.md |
| |
+------------------------------------------------------------+
2.2 Trigger Conditions: Dual-Threshold Gating
Not every conversation turn triggers a summary update — carefully designed conditions must be met:
def should_extract_memory(messages):
current_tokens = estimate_total_tokens(messages)
# Threshold 0: conversation just started, too little content
if not initialized:
if current_tokens < INIT_THRESHOLD:
return False
initialized = True
# Threshold 1 (hard): has token growth since last extraction been sufficient?
token_threshold_met = (
current_tokens - last_extraction_tokens >= MIN_TOKENS_BETWEEN_UPDATE
)
# Threshold 2 (soft): have there been enough tool calls since last extraction?
tool_call_threshold_met = (
count_tool_calls_since(last_extraction_msg) >= TOOL_CALLS_BETWEEN_UPDATES
)
# Threshold 3 (opportunistic): did the last turn have no tool calls? (natural pause)
at_natural_break = not has_tool_calls_in_last_turn(messages)
# Trigger: token threshold is mandatory, plus either of the other two
return token_threshold_met and (tool_call_threshold_met or at_natural_break)
Design philosophy: The token threshold is a hard requirement (preventing overly frequent extraction that wastes costs), while tool call count and "natural pause" are soft triggers (extracting during moments when the model pauses to think, without interrupting the workflow).
2.3 Summary File Structure
Session Memory maintains a Markdown file with a fixed template containing 10 sections:
+--------- session_memory.md ----------+
| |
| # Session Title (5-10 words) |
| # Current State (important!) |
| # Task Specification |
| # Files and Functions |
| # Workflow |
| # Errors & Corrections |
| # Codebase and System Docs |
| # Learnings |
| # Key Results |
| # Worklog |
| |
| Per-section limit : ~2000 tokens |
| Total file limit : ~12000 tokens |
+--------------------------------------+
When the file exceeds its budget, the update prompt instructs the sub-agent to proactively trim:
def build_update_prompt(current_notes, notes_path):
prompt = render_template(TEMPLATE, current_notes, notes_path)
# Analyze each section's size
section_sizes = analyze_sections(current_notes)
total = estimate_tokens(current_notes)
# Over total budget → force trimming
if total > 12_000:
prompt += f"""
CRITICAL: File is currently ~{total} tokens, exceeding the 12000 limit.
You must significantly trim, prioritizing Current State and Errors."""
# Individual section over limit → remind to trim that section
oversized = [s for s in section_sizes if s.tokens > 2000]
if oversized:
prompt += f"\nThe following sections need trimming: {oversized}"
return prompt
2.4 Update Method: Fork Agent with Permission Isolation
Session Memory updates are executed through a restricted Fork Agent. The key design — only allowed to edit the summary file, nothing else:
def create_memory_file_permission(memory_path):
"""Create permission function: only allow Edit tool on the specified summary file"""
def can_use_tool(tool, input):
if tool.name == "FileEdit" and input["file_path"] == memory_path:
return ALLOW
return DENY # Deny any other operation
return can_use_tool
# Launch Fork Agent
run_forked_agent(
prompt = build_update_prompt(current_notes, path),
can_use_tool = create_memory_file_permission(path), # Strict permissions
query_source = "session_memory",
# Fork Agent shares main conversation's prompt cache, lower cost
)
2.5 User Customization
Session Memory supports custom templates and prompts, which users can place in:
~/.claude/session-memory/config/
├── template.md ← Custom summary template (replaces the default 10 sections)
└── prompt.md ← Custom update instructions (supports {{currentNotes}} and other variables)
Layer 3: Persistent Memory (memdir) — Cross-Session Long-Term Memory
3.1 Architecture
Persistent memory is a file-system-based knowledge base:
~/.claude/projects/<project>/memory/
│
├── MEMORY.md ← Index file (always loaded into system prompt)
│ Example contents:
│ - [User Preferences](user_preferences.md) — Prefers bun, uses Go
│ - [Project Setup](project_setup.md) — K8s deploy, GH Actions CI
│ - [API Gotchas](api_gotchas.md) — Auth token must be refreshed
│
├── user_preferences.md ← User preferences
├── project_setup.md ← Project config
├── feedback_testing.md ← Feedback records
├── api_gotchas.md ← Reference info
│
└── team/ ← Shared team memory (optional)
├── MEMORY.md
└── coding_standards.md
3.2 Four Memory Types
The system defines a strict taxonomy where each type has clear boundaries:
+-----------+----------------------------+-------------------------------+
| Type | Description | Example |
+-----------+----------------------------+-------------------------------+
| user | Identity, role, prefs | "Backend dev, prefers Go" |
+-----------+----------------------------+-------------------------------+
| feedback | Corrections & feedback | "Use const, not var" |
+-----------+----------------------------+-------------------------------+
| project | Non-code project knowledge | "Deploy: k8s, CI: GH Actions" |
+-----------+----------------------------+-------------------------------+
| reference | Tool / API usage notes | "Internal API auth method" |
+-----------+----------------------------+-------------------------------+
X NOT saved: info derivable from code (architecture, git history)
Each memory file uses YAML frontmatter to annotate metadata:
---
name: user_preferences
description: User's coding style preferences
type: feedback
---
Use bun instead of npm for all package management.
Prefer functional components over class components.
3.3 Index File Constraints
MEMORY.md is always loaded into the system prompt, so it
has strict limits to prevent unbounded growth:
MAX_LINES = 200 # Maximum 200 lines
MAX_BYTES = 25_000 # Maximum 25KB
def truncate_memory_index(raw_content):
lines = raw_content.strip().split("\n")
# First truncate by line count
if len(lines) > MAX_LINES:
lines = lines[:MAX_LINES]
truncated_by_lines = True
# Then truncate by byte count (guards against very long individual lines)
content = "\n".join(lines)
if len(content) > MAX_BYTES:
cut_at = content.rfind("\n", 0, MAX_BYTES) # Truncate at line boundary
content = content[:cut_at]
truncated_by_bytes = True
if truncated_by_lines or truncated_by_bytes:
content += "\n\n> WARNING: MEMORY.md is too large, only partial content loaded."
content += "\n> Keep index entries under ~200 chars per line, put details in topic files."
return content
3.4 Intelligent Memory Retrieval
Not all memory files are loaded for every conversation. The system uses a lightweight Sonnet model for real-time retrieval, loading only the most relevant memories on demand:
User sends: "Help me fix auth logic in the payment API"
|
v
+----------------------------------------------+
| Scan memory/ dir for all .md files |
| Read each file's frontmatter |
| (name + description + type) |
| Exclude files already shown this turn |
+----------------------+-----------------------+
|
v
+----------------------------------------------+
| Call Sonnet (lightweight, fast) |
| |
| System: "Select up to 5 memories that |
| will be useful for the current query" |
| |
| User: "Query: fix payment API auth logic |
| Available memories: |
| - api_gotchas.md: API auth caveats [HIT] |
| - user_prefs.md: coding preferences [HIT] |
| - old_deploy.md: old deploy notes [SKIP] |
| Recent tools: FileEdit, Bash" |
| |
| -> Returns: ["api_gotchas.md", |
| "user_prefs.md"] |
+----------------------+-----------------------+
|
v
Load selected memory files into context
Two clever filtering mechanisms:
- Deduplication: If a memory file was already shown in a previous turn, it won't be recommended again, freeing up slots for new candidates
- Tool awareness: If the user recently used a
specific tool (e.g.,
mcp__X__spawn), the system won't recommend that tool's usage docs (since usage examples are already in the conversation), but it still recommends memories containing warnings/gotchas
3.5 Background Memory Extraction
Beyond users explicitly saying "remember this," Claude Code also automatically extracts noteworthy content in the background. The extraction agent has strict constraints:
+--- Memory Extraction Agent Rules --------+
| |
| ALLOWED: |
| +- Read / Grep / Glob memory dir files |
| +- Read-only Bash (ls/find/cat/stat) |
| +- Edit / Write files inside memory dir |
| |
| DENIED: |
| +- X Read or modify source code files |
| +- X Write-capable Bash commands |
| +- X MCP, Agent, or other tools |
| +- X Bash rm (no deleting files) |
| |
| EFFICIENCY (limited turn budget): |
| +- Turn 1: all Read calls in parallel |
| +- Turn 2: all Write/Edit in parallel |
| |
| CONTENT CONSTRAINT: |
| +- Extract from conversation only, |
| never investigate source code |
| |
+------------------------------------------+
How the Three Layers Work Together
Here's a complete timeline showing how the three memory layers collaborate:
Timeline ─────────────────────────────────────────────────────►
▎ SESSION START
▎ ├─ Load MEMORY.md index into system prompt ·········· Persistent Memory
▎ ├─ Sonnet retrieves relevant memory files ··········· Persistent Memory
▎
▎ CONVERSATION IN PROGRESS...
▎ ├─ Each model response -> post_sampling_hook ········ Session Memory
▎ │ (background update)
▎ │
▎ │ [Conversation grows ~5K tokens]
▎ │ └─ Session Memory updates session_memory.md
▎ │
▎ │ [Conversation continues...]
▎ │ └─ Memory Extraction Agent checks for ············ Persistent Memory
▎ │ save-worthy content (preferences, feedback) (background write)
▎ │
▎ │ [Context approaching 93%]
▎ │ ├─ Try SM Compact ································ Context Management
▎ │ │ ├─ Success -> keep recent 10K-40K tokens
▎ │ │ └─ Fail -> Full Compact -> PTL Retry
▎ │ └─ Restore files, Plan, Skills, Agent state
▎ │
▎ │ [Conversation continues... (loop)]
▎
▎ SESSION END
▎ ├─ session_memory.md preserved (resumable) ·········· Session Memory
▎ └─ memory/*.md persisted permanently ················ Persistent Memory
▎
▎ NEXT SESSION START
▎ ├─ Reload MEMORY.md
▎ ├─ Retrieve relevant memory files
▎ └─ If resuming -> read session_memory.md as context
Engineering Highlights
Prompt Cache Protection
Compaction is the natural enemy of prompt cache — all messages change, so the cache invalidates. The system's counterstrategy:
Before compaction:
Main prompt cache: [system_prompt | tools | msg1...msgN]
<---- cache hit ---->
After compaction:
New messages: [system_prompt | tools | summary | recent_msgs]
<-- still hit! -->
(system + tools unchanged)
+ Notify cache monitor: "compaction just happened,
cache hit-rate drop is expected, suppress alerts"
The Fork Agent design also maximizes cache reuse — the compaction agent sends requests with a prefix identical to the main conversation (system prompt + tools + message prefix), thereby hitting the same cache.
Image and Media Handling
Before compaction, images and document attachments are stripped (they're useless for generating text summaries but consume significant tokens):
def strip_images(messages):
"""Images and documents → text placeholders"""
for msg in messages:
for block in msg.content:
if block.type == "image":
replace_with("[image]")
elif block.type == "document":
replace_with("[document]")
return messages
Session Metadata Persistence
An easily overlooked detail: session metadata (custom titles, tags)
must be re-appended after compaction. Because the --resume
feature reads the last 16KB of the log file to retrieve
session info, if compaction generates a large number of new messages,
the metadata gets pushed out of this window, causing auto-generated
titles to display instead of user-set names.
Summary
Claude Code's memory management system is a carefully designed three-layer architecture:
| Layer | Mechanism | Lifecycle | Cost |
|---|---|---|---|
| Context Window | Three-level compaction (SM → Full → PTL) | Single conversation | SM: zero, Full: 1 API call |
| Session Memory | Background Fork Agent continuous summary | Single session | ~1 lightweight API call per ~5K tokens |
| Persistent Memory | File system + Sonnet intelligent retrieval | Permanent | Retrieval: 1 Sonnet call |
The core design philosophy: Rather than pursuing larger context windows, achieve "remembers well, forgets right, retrieves accurately" within a limited window.
- Remembers well: Session Memory continuously distills, never missing key information
- Forgets right: Three-level compaction retains the most important content, discards redundant tool call details
- Retrieves accurately: memdir organized by type + Sonnet intelligent retrieval for precise recall
This system enables Claude Code to achieve an "effective memory capacity" far exceeding 200K within its 200K context window.
All code blocks in this article are Python pseudocode or diagrams inferred from actual behavior and publicly available information, intended to aid understanding of Claude Code's internal mechanisms.