A Deep Dive into Claude Code's Memory Management

Posted on 2026-04-02 In NLP 评论: Views:

Developers who've used Claude Code probably share this experience: even in an ultra-long conversation where dozens of files have been modified, it seems to always "remember" what it did before. Even more remarkably, if you told it "I prefer bun over npm" in a previous session, it automatically follows that preference next time.

Behind this is a sophisticated memory management system. Let's tear apart Claude Code's memory mechanism layer by layer.

Global Architecture: A Three-Layer Memory System

Claude Code's memory management can be compared to the human memory system:

+----------------------------------------------------------+
|          Persistent Memory (Long-term Memory)            |
|  memdir: MEMORY.md index + topic files                   |
|  Persisted across sessions                               |
+----------------------------------------------------------+
|          Session Memory (Working Memory)                 |
|  Background sub-agent maintains Markdown summary         |
|  Valid within a single session                           |
+----------------------------------------------------------+
|          Context Window (Short-term Memory)              |
|  Raw messages + tool call results in current conversation|
|  Immediately available                                   |
+----------------------------------------------------------+

Let's examine each layer.

Layer 1: Context Window Management — A Three-Level Compaction Strategy

1.1 When Does Compaction Trigger?

The auto-compaction trigger logic, expressed in pseudocode:

BUFFER_TOKENS = 13_000  # Reserved buffer

def get_auto_compact_threshold(model):
    """Calculate auto-compact threshold ≈ ~93% of effective window"""
    effective_window = get_context_window(model) - MAX_OUTPUT_TOKENS
    return effective_window - BUFFER_TOKENS

def should_auto_compact(messages, model):
    token_count = estimate_token_count(messages)
    threshold = get_auto_compact_threshold(model)
    return token_count >= threshold

Using a 200K context window as an example, compaction kicks in at roughly 93% utilization.

There's also a circuit breaker — after 3 consecutive failures, it stops retrying to prevent wasting API calls in a "death loop":

MAX_FAILURES = 3

def auto_compact_if_needed(messages, tracking):
    # Circuit breaker: too many consecutive failures, give up
    if tracking.consecutive_failures >= MAX_FAILURES:
        return NOT_COMPACTED

    if not should_auto_compact(messages):
        return NOT_COMPACTED

    # ... execute compaction

1.2 Three-Level Compaction Execution Order

When compaction triggers, the system tries three strategies in priority order:

  +--------------------------+
  |  Context reaching ~93%   |
  +------------+-------------+
               |
               v
+---------------------------------+
|  Level 1: SM Compact            |
|  Replace old msgs with          |  <-- Zero API calls
|  pre-built summary              |
+------+----------------+---------+
  OK   |                | Fail / N/A
       v                v
   return   +---------------------------------+
            |  Level 2: Full Compact          |
            |  Call API to generate           |  <-- 1 API call
            |  structured summary             |
            +------+----------------+---------+
              OK   |                | prompt-too-long
                   v                v
               return   +---------------------------------+
                        |  Level 3: PTL Retry             |
                        |  Truncate oldest messages,      |  <-- Max 3 retries
                        |  retry compaction               |
                        +---------------------------------+

The overall flow in pseudocode:

def auto_compact_if_needed(messages, context):
    # === Try Session Memory compaction first (zero cost) ===
    result = try_session_memory_compaction(messages)
    if result:
        return result  # Success! No API calls made

    # === Fall back to traditional API compaction ===
    result = full_compact_conversation(messages, context)
    return result

1.3 Level 1: Session Memory Compact — Zero-Cost Compaction

This is the most elegant compaction strategy. When a Session Memory file is available, it directly replaces old conversation history without any additional API calls.

The message retention policy is controlled by three parameters:

SM_COMPACT_CONFIG = {
    "min_tokens": 10_000,           # Keep at least 10K tokens of raw messages
    "min_text_messages": 5,          # Keep at least 5 text-containing messages
    "max_tokens": 40_000,           # Keep at most 40K tokens (hard cap)
}

Before and after comparison:

Before (context window almost full):
+-----------------------------------------------------+
| msg1 | msg2 | ... | msg50 | msg51 | ... | msg80     |
|<-- covered by SM summary -->|<-- recent messages -->|
+-----------------------------------------------------+

                       | SM Compact
                       v

After (lots of free space):
+-----------------------------------------------------+
| [boundary] | SM summary  | msg51 | ... | msg80      |
|            | (~12K toks) |<-- kept raw messages --> |
|            |             |   (10K ~ 40K tokens)     |
| <----- used -----------> |<---- free space -------->|
+-----------------------------------------------------+

The core algorithm for deciding which messages to keep:

def calculate_messages_to_keep(messages, last_summarized_idx):
    """Start from the last summarized message, expand backward until conditions are met"""
    config = SM_COMPACT_CONFIG
    start = last_summarized_idx + 1
    total_tokens = 0
    text_msg_count = 0

    # First count what's already after start
    for msg in messages[start:]:
        total_tokens += estimate_tokens(msg)
        if has_text_content(msg):
            text_msg_count += 1

    # If already at hard cap or both minimums satisfied, return
    if total_tokens >= config["max_tokens"]:
        return start
    if total_tokens >= config["min_tokens"] and \
       text_msg_count >= config["min_text_messages"]:
        return start

    # Otherwise expand backward, pulling in more messages
    for i in range(start - 1, -1, -1):
        total_tokens += estimate_tokens(messages[i])
        if has_text_content(messages[i]):
            text_msg_count += 1
        start = i

        if total_tokens >= config["max_tokens"]:
            break
        if total_tokens >= config["min_tokens"] and \
           text_msg_count >= config["min_text_messages"]:
            break

    # Finally: protect tool_use/tool_result pairs from being split
    return adjust_for_tool_pairs(messages, start)

There's an elegant detail here — protecting tool_use/tool_result pairs from being split apart:

def adjust_for_tool_pairs(messages, start_index):
    """Ensure every tool_result in the retained range has its matching tool_use"""
    # Collect all tool_result IDs in the retained range
    needed_tool_use_ids = set()
    for msg in messages[start_index:]:
        for block in msg.content:
            if block.type == "tool_result":
                needed_tool_use_ids.add(block.tool_use_id)

    # Search backward, pulling matching tool_use messages into retained range
    for i in range(start_index - 1, -1, -1):
        if has_matching_tool_use(messages[i], needed_tool_use_ids):
            start_index = i  # Expand retained range

    return start_index

Why is this necessary? Because the Claude API requires every tool_result to have a corresponding tool_use, or it throws an error. If compaction happens to "cut" in the middle of a tool call pair, an "orphan tool_result" would cause subsequent API calls to fail.

1.4 Level 2: Full Compact — Traditional API Compaction

When Session Memory isn't available, it calls the Claude API to generate a summary of the entire conversation.

The compaction prompt requires a structured summary with 9 sections:

+------- Full Compact Summary Structure -------+
|                                              |
|  1. Primary Request and Intent               |
|  2. Key Technical Concepts                   |
|  3. Files and Code Sections                  |
|  4. Errors and Fixes                         |
|  5. Problem Solving                          |
|  6. All User Messages  <-- verbatim!         |
|  7. Pending Tasks                            |
|  8. Current Work                             |
|  9. Optional Next Step                       |
|                                              |
+----------------------------------------------+
        ^                          ^
        |                          |
  <analysis> scratchpad     #6 preserves user's
  think-then-summarize      exact words to prevent
  stripped before storing   intent drift

Three noteworthy design choices:

Design 1: <analysis> scratchpad — The model first freely analyzes within <analysis> tags, then outputs the final summary within <summary> tags. The scratchpad is stripped by the program and never enters the context, but it significantly improves summary quality.

Design 2: Tool use prohibited — The compaction agent is strictly limited to text output only. Since it has a budget of just 1 turn, attempting a tool call would waste that turn entirely.

Design 3: Prompt cache sharing — The compaction agent launches via a "Fork" mechanism, sharing the same system prompt, tool list, and message prefix with the main conversation, thereby reusing the prompt cache and significantly reducing costs.

1.5 Level 3: PTL Retry — The Last Safety Net

If the compaction request itself exceeds the token limit (prompt-too-long), truncation retry kicks in:

MAX_PTL_RETRIES = 3

def compact_with_retry(messages, summary_request):
    for attempt in range(MAX_PTL_RETRIES):
        response = call_api(messages + [summary_request])

        if not is_prompt_too_long(response):
            return response  # Success

        # Group by API round, discard from oldest
        groups = group_by_api_round(messages)

        if token_gap is not None:
            # Precise mode: calculate exactly how many groups to drop
            drop_count = calculate_exact_drop(groups, token_gap)
        else:
            # Fallback mode: drop oldest 20%
            drop_count = max(1, len(groups) * 0.2)

        messages = flatten(groups[drop_count:])

    raise Error("Conversation too long, cannot compact")

1.6 State Restoration After Compaction

Compaction isn't just about deleting messages. When old messages are discarded, much implicit context is lost too. The system needs to restore critical state:

+--- 6 Types of State to Restore After Compaction ----+
|                                                     |
|  1. Recent files (max 5, budget 50K tokens)         |
|     Avoid re-reading same files after compact       |
|                                                     |
|  2. Current Plan file                               |
|     Keep the plan if one is in progress             |
|                                                     |
|  3. Invoked Skills content (max 5K tokens each)     |
|     Skill instructions must survive compact         |
|                                                     |
|  4. Async Agent status                              |
|     Is a background agent done? Where's the result? |
|                                                     |
|  5. Deferred tool definitions                       |
|     Re-register tools discovered before compact     |
|                                                     |
|  6. SessionStart hooks (CLAUDE.md etc.)             |
|     Project config must be re-injected into context |
|                                                     |
+-----------------------------------------------------+

There's also a clever deduplication logic — if file contents are already visible in the retained messages, skip restoration to avoid wasting tokens:

def restore_files_after_compact(read_file_cache, kept_messages):
    # Collect Read result paths already in retained messages
    already_visible = collect_read_paths_from(kept_messages)

    recent_files = sorted(read_file_cache, key=lambda f: f.timestamp, reverse=True)

    budget_used = 0
    for file in recent_files[:5]:
        if file.path in already_visible:
            continue  # Skip! Already in the messages
        if budget_used + file.tokens > 50_000:
            break
        restore_as_attachment(file)
        budget_used += file.tokens

Layer 2: Session Memory — The Background Continuous Summarization Engine

2.1 Core Mechanism

Session Memory is one of Claude Code's most innovative designs. It runs an independent sub-agent in the background that continuously distills conversation content into a structured Markdown notes file.

+-- Main Conversation Loop ----------------------------------+
|                                                            |
|  User query -> Model response -> Tool call -> ... loop     |
|                     |                                      |
|           post_sampling_hook fires                         |
|           after each response                              |
|                     |                                      |
|                     v                                      |
|  +--------------------------------------------+            |
|  | Should Session Memory update?              |            |
|  | (check token threshold & tool call count)  |            |
|  +--------+---------------------------+-------+            |
|      Yes  |                           | No                 |
|           v                           +--- skip            |
|  +------------------------+                                |
|  | Fork sub-agent         | <-- separate context,          |
|  | only allowed to Edit   |     non-blocking               |
|  | the summary file       |                                |
|  +-----------+------------+                                |
|              v                                             |
|  Update session_memory.md                                  |
|                                                            |
+------------------------------------------------------------+

2.2 Trigger Conditions: Dual-Threshold Gating

Not every conversation turn triggers a summary update — carefully designed conditions must be met:

def should_extract_memory(messages):
    current_tokens = estimate_total_tokens(messages)

    # Threshold 0: conversation just started, too little content
    if not initialized:
        if current_tokens < INIT_THRESHOLD:
            return False
        initialized = True

    # Threshold 1 (hard): has token growth since last extraction been sufficient?
    token_threshold_met = (
        current_tokens - last_extraction_tokens >= MIN_TOKENS_BETWEEN_UPDATE
    )

    # Threshold 2 (soft): have there been enough tool calls since last extraction?
    tool_call_threshold_met = (
        count_tool_calls_since(last_extraction_msg) >= TOOL_CALLS_BETWEEN_UPDATES
    )

    # Threshold 3 (opportunistic): did the last turn have no tool calls? (natural pause)
    at_natural_break = not has_tool_calls_in_last_turn(messages)

    # Trigger: token threshold is mandatory, plus either of the other two
    return token_threshold_met and (tool_call_threshold_met or at_natural_break)

Design philosophy: The token threshold is a hard requirement (preventing overly frequent extraction that wastes costs), while tool call count and "natural pause" are soft triggers (extracting during moments when the model pauses to think, without interrupting the workflow).

2.3 Summary File Structure

Session Memory maintains a Markdown file with a fixed template containing 10 sections:

+--------- session_memory.md ----------+
|                                      |
|  # Session Title      (5-10 words)   |
|  # Current State      (important!)   |
|  # Task Specification                |
|  # Files and Functions               |
|  # Workflow                          |
|  # Errors & Corrections              |
|  # Codebase and System Docs          |
|  # Learnings                         |
|  # Key Results                       |
|  # Worklog                           |
|                                      |
|  Per-section limit : ~2000 tokens    |
|  Total file limit  : ~12000 tokens   |
+--------------------------------------+

When the file exceeds its budget, the update prompt instructs the sub-agent to proactively trim:

def build_update_prompt(current_notes, notes_path):
    prompt = render_template(TEMPLATE, current_notes, notes_path)

    # Analyze each section's size
    section_sizes = analyze_sections(current_notes)
    total = estimate_tokens(current_notes)

    # Over total budget → force trimming
    if total > 12_000:
        prompt += f"""
        CRITICAL: File is currently ~{total} tokens, exceeding the 12000 limit.
        You must significantly trim, prioritizing Current State and Errors."""

    # Individual section over limit → remind to trim that section
    oversized = [s for s in section_sizes if s.tokens > 2000]
    if oversized:
        prompt += f"\nThe following sections need trimming: {oversized}"

    return prompt

2.4 Update Method: Fork Agent with Permission Isolation

Session Memory updates are executed through a restricted Fork Agent. The key design — only allowed to edit the summary file, nothing else:

def create_memory_file_permission(memory_path):
    """Create permission function: only allow Edit tool on the specified summary file"""
    def can_use_tool(tool, input):
        if tool.name == "FileEdit" and input["file_path"] == memory_path:
            return ALLOW
        return DENY  # Deny any other operation

    return can_use_tool

# Launch Fork Agent
run_forked_agent(
    prompt       = build_update_prompt(current_notes, path),
    can_use_tool = create_memory_file_permission(path),  # Strict permissions
    query_source = "session_memory",
    # Fork Agent shares main conversation's prompt cache, lower cost
)

2.5 User Customization

Session Memory supports custom templates and prompts, which users can place in:

~/.claude/session-memory/config/
├── template.md    ← Custom summary template (replaces the default 10 sections)
└── prompt.md      ← Custom update instructions (supports {{currentNotes}} and other variables)

Layer 3: Persistent Memory (memdir) — Cross-Session Long-Term Memory

3.1 Architecture

Persistent memory is a file-system-based knowledge base:

~/.claude/projects/<project>/memory/
│
├── MEMORY.md                ← Index file (always loaded into system prompt)
│   Example contents:
│   - [User Preferences](user_preferences.md) — Prefers bun, uses Go
│   - [Project Setup](project_setup.md) — K8s deploy, GH Actions CI
│   - [API Gotchas](api_gotchas.md) — Auth token must be refreshed
│
├── user_preferences.md      ← User preferences
├── project_setup.md         ← Project config
├── feedback_testing.md      ← Feedback records
├── api_gotchas.md           ← Reference info
│
└── team/                    ← Shared team memory (optional)
    ├── MEMORY.md
    └── coding_standards.md

3.2 Four Memory Types

The system defines a strict taxonomy where each type has clear boundaries:

+-----------+----------------------------+-------------------------------+
|   Type    |       Description          |         Example               |
+-----------+----------------------------+-------------------------------+
|  user     | Identity, role, prefs      | "Backend dev, prefers Go"     |
+-----------+----------------------------+-------------------------------+
| feedback  | Corrections & feedback     | "Use const, not var"          |
+-----------+----------------------------+-------------------------------+
| project   | Non-code project knowledge | "Deploy: k8s, CI: GH Actions" |
+-----------+----------------------------+-------------------------------+
| reference | Tool / API usage notes     | "Internal API auth method"    |
+-----------+----------------------------+-------------------------------+

     X  NOT saved: info derivable from code (architecture, git history)

Each memory file uses YAML frontmatter to annotate metadata:

---
name: user_preferences
description: User's coding style preferences
type: feedback
---
Use bun instead of npm for all package management.
Prefer functional components over class components.

3.3 Index File Constraints

MEMORY.md is always loaded into the system prompt, so it has strict limits to prevent unbounded growth:

MAX_LINES = 200       # Maximum 200 lines
MAX_BYTES = 25_000    # Maximum 25KB

def truncate_memory_index(raw_content):
    lines = raw_content.strip().split("\n")

    # First truncate by line count
    if len(lines) > MAX_LINES:
        lines = lines[:MAX_LINES]
        truncated_by_lines = True

    # Then truncate by byte count (guards against very long individual lines)
    content = "\n".join(lines)
    if len(content) > MAX_BYTES:
        cut_at = content.rfind("\n", 0, MAX_BYTES)  # Truncate at line boundary
        content = content[:cut_at]
        truncated_by_bytes = True

    if truncated_by_lines or truncated_by_bytes:
        content += "\n\n> WARNING: MEMORY.md is too large, only partial content loaded."
        content += "\n> Keep index entries under ~200 chars per line, put details in topic files."

    return content

3.4 Intelligent Memory Retrieval

Not all memory files are loaded for every conversation. The system uses a lightweight Sonnet model for real-time retrieval, loading only the most relevant memories on demand:

User sends: "Help me fix auth logic in the payment API"
    |
    v
+----------------------------------------------+
| Scan memory/ dir for all .md files           |
| Read each file's frontmatter                 |
| (name + description + type)                  |
| Exclude files already shown this turn        |
+----------------------+-----------------------+
                       |
                       v
+----------------------------------------------+
| Call Sonnet (lightweight, fast)              |
|                                              |
| System: "Select up to 5 memories that        |
|  will be useful for the current query"       |
|                                              |
| User: "Query: fix payment API auth logic     |
|  Available memories:                         |
|  - api_gotchas.md: API auth caveats    [HIT] |
|  - user_prefs.md: coding preferences   [HIT] |
|  - old_deploy.md: old deploy notes    [SKIP] |
|  Recent tools: FileEdit, Bash"               |
|                                              |
| -> Returns: ["api_gotchas.md",               |
|              "user_prefs.md"]                |
+----------------------+-----------------------+
                       |
                       v
         Load selected memory files into context

Two clever filtering mechanisms:

Deduplication: If a memory file was already shown in a previous turn, it won't be recommended again, freeing up slots for new candidates
Tool awareness: If the user recently used a specific tool (e.g., mcp__X__spawn), the system won't recommend that tool's usage docs (since usage examples are already in the conversation), but it still recommends memories containing warnings/gotchas

3.5 Background Memory Extraction

Beyond users explicitly saying "remember this," Claude Code also automatically extracts noteworthy content in the background. The extraction agent has strict constraints:

+--- Memory Extraction Agent Rules --------+
|                                          |
|  ALLOWED:                                |
|  +- Read / Grep / Glob memory dir files  |
|  +- Read-only Bash (ls/find/cat/stat)    |
|  +- Edit / Write files inside memory dir |
|                                          |
|  DENIED:                                 |
|  +- X  Read or modify source code files  |
|  +- X  Write-capable Bash commands       |
|  +- X  MCP, Agent, or other tools        |
|  +- X  Bash rm (no deleting files)       |
|                                          |
|  EFFICIENCY (limited turn budget):       |
|  +- Turn 1: all Read calls in parallel   |
|  +- Turn 2: all Write/Edit in parallel   |
|                                          |
|  CONTENT CONSTRAINT:                     |
|  +- Extract from conversation only,      |
|     never investigate source code        |
|                                          |
+------------------------------------------+

How the Three Layers Work Together

Here's a complete timeline showing how the three memory layers collaborate:

Timeline ─────────────────────────────────────────────────────►

▎ SESSION START
▎ ├─ Load MEMORY.md index into system prompt ··········  Persistent Memory
▎ ├─ Sonnet retrieves relevant memory files ···········  Persistent Memory
▎
▎ CONVERSATION IN PROGRESS...
▎ ├─ Each model response -> post_sampling_hook ········  Session Memory
▎ │                                                     (background update)
▎ │
▎ │  [Conversation grows ~5K tokens]
▎ │  └─ Session Memory updates session_memory.md
▎ │
▎ │  [Conversation continues...]
▎ │  └─ Memory Extraction Agent checks for ············  Persistent Memory
▎ │     save-worthy content (preferences, feedback)      (background write)
▎ │
▎ │  [Context approaching 93%]
▎ │  ├─ Try SM Compact ································  Context Management
▎ │  │   ├─ Success -> keep recent 10K-40K tokens
▎ │  │   └─ Fail -> Full Compact -> PTL Retry
▎ │  └─ Restore files, Plan, Skills, Agent state
▎ │
▎ │  [Conversation continues... (loop)]
▎
▎ SESSION END
▎ ├─ session_memory.md preserved (resumable) ··········  Session Memory
▎ └─ memory/*.md persisted permanently ················  Persistent Memory
▎
▎ NEXT SESSION START
▎ ├─ Reload MEMORY.md
▎ ├─ Retrieve relevant memory files
▎ └─ If resuming -> read session_memory.md as context

Engineering Highlights

Prompt Cache Protection

Compaction is the natural enemy of prompt cache — all messages change, so the cache invalidates. The system's counterstrategy:

Before compaction:
  Main prompt cache: [system_prompt | tools | msg1...msgN]
                     <---- cache hit ---->

After compaction:
  New messages: [system_prompt | tools | summary | recent_msgs]
                <-- still hit! -->
                (system + tools unchanged)

  + Notify cache monitor: "compaction just happened,
    cache hit-rate drop is expected, suppress alerts"

The Fork Agent design also maximizes cache reuse — the compaction agent sends requests with a prefix identical to the main conversation (system prompt + tools + message prefix), thereby hitting the same cache.

Image and Media Handling

Before compaction, images and document attachments are stripped (they're useless for generating text summaries but consume significant tokens):

def strip_images(messages):
    """Images and documents → text placeholders"""
    for msg in messages:
        for block in msg.content:
            if block.type == "image":
                replace_with("[image]")
            elif block.type == "document":
                replace_with("[document]")
    return messages

Session Metadata Persistence

An easily overlooked detail: session metadata (custom titles, tags) must be re-appended after compaction. Because the --resume feature reads the last 16KB of the log file to retrieve session info, if compaction generates a large number of new messages, the metadata gets pushed out of this window, causing auto-generated titles to display instead of user-set names.

Summary

Claude Code's memory management system is a carefully designed three-layer architecture:

Layer	Mechanism	Lifecycle	Cost
Context Window	Three-level compaction (SM → Full → PTL)	Single conversation	SM: zero, Full: 1 API call
Session Memory	Background Fork Agent continuous summary	Single session	~1 lightweight API call per ~5K tokens
Persistent Memory	File system + Sonnet intelligent retrieval	Permanent	Retrieval: 1 Sonnet call

The core design philosophy: Rather than pursuing larger context windows, achieve "remembers well, forgets right, retrieves accurately" within a limited window.

Remembers well: Session Memory continuously distills, never missing key information
Forgets right: Three-level compaction retains the most important content, discards redundant tool call details
Retrieves accurately: memdir organized by type + Sonnet intelligent retrieval for precise recall

This system enables Claude Code to achieve an "effective memory capacity" far exceeding 200K within its 200K context window.

All code blocks in this article are Python pseudocode or diagrams inferred from actual behavior and publicly available information, intended to aid understanding of Claude Code's internal mechanisms.