Inside Claude Code's YOLO Classifier

Posted on 2026-04-05 In NLP 评论: Views:

Claude Code has an auto mode that executes operations without confirmation. But "auto" doesn't mean "unreviewed" — there's a classifier watching every action.

The Auto Mode Paradox

One of the most annoying things about using Claude Code is the permission popups. Every Bash command, every file write requires a confirmation click. Power users turn on auto mode, letting Claude execute everything autonomously without asking.

This creates an obvious problem: what if the model decides to rm -rf /, push to the production branch, or write a backdoor into .bashrc?

Claude Code's solution isn't "trust the model" — it's "use another AI to review this AI." The source module is called YOLO Classifier. YOLO stands for You Only Live Once, a self-deprecating name for a "just do it" mode. But inside this "just do it" mode, every step is actually being questioned.

Why Not a Rule Engine

The first instinct might be: just write a blocklist. Block rm -rf, block git push --force, block curl | bash.

This doesn't work. Bash command combinations are infinite. find / -delete also wipes the filesystem but looks nothing like rm. What about python -c "import os; os.remove('/')" ? Or base64 -d <<< "cm0gLXJmIC8=" | bash? Write a thousand rules and there's always a thousand-and-first way around them.

Traditional security tools use regex matching and AST analysis (Claude Code does have tree-sitter parsing for Bash commands, but that's another story). These are all "form matching" — they can judge what a command looks like, not what it intends to do.

An LLM can understand intent. It sees a base64-encoded command and can reason about what it decodes to. It sees a complex pipeline and can judge whether the end result is reading a file or deleting one. This is the fundamental reason for using AI to review AI: security judgment requires semantic understanding, and rule engines can't provide that.

Two Stages: The Economics of 64 Tokens vs 4096 Tokens

The problem with AI-based review is cost. A full API call for every operation, with a model potentially executing dozens or hundreds of operations per session — the review cost could exceed the operation itself.

Claude Code's solution is a two-stage classifier:

Model: "I want to run: npm test"
      |
      v
+------------------------------------------+
|  STAGE 1: Fast Check (64 tokens max)     |
|  Prompt ends with:                       |
|  "Err on the side of blocking.           |
|   <block> immediately."                  |
|                                          |
|  -> <block>no</block>  = ALLOW (done)    |
|  -> <block>yes</block> = escalate to S2  |
|  -> unparseable        = escalate to S2  |
+------------------------------------------+
      |  (only if stage 1 says block)
      v
+------------------------------------------+
|  STAGE 2: Deep Check (4096 tokens max)   |
|  Prompt ends with:                       |
|  "Review the classification process...   |
|   Use <thinking> before responding."     |
|                                          |
|  -> <block>no</block>  = ALLOW           |
|  -> <block>yes</block> = BLOCK           |
|  -> unparseable        = BLOCK           |
+------------------------------------------+

Stage 1's prompt tail literally says "Err on the side of blocking. <block> immediately." — meaning "rather over-block than under-block, give your verdict now." max_tokens is set to 64, with stop_sequences including </block>, cutting the model off as soon as it outputs a conclusion. No room for rambling.

The vast majority of operations are safe (npm test, cat file.txt, git status). Stage 1 lets them through with a single <block>no</block> — 4 output tokens total. Only operations flagged as suspicious by Stage 1 proceed to Stage 2, which uses <thinking> for chain-of-thought reasoning with a 4096 token output limit.

This isn't a question of "can we do it" but "how much money buys how much safety." The two stages plus Prompt Cache sharing (Stage 2's input is nearly identical to Stage 1's, so cache hit rates are high) compress the average review cost per operation to an acceptable range.

Distrusting Its Own Output

The transcript fed to the classifier has one critical transformation: all text content in assistant messages is stripped out, keeping only tool_use blocks.

def build_transcript(messages):
    transcript = []
    for msg in messages:
        if msg.role == "user":
            transcript.append({"user": msg.text})
        elif msg.role == "assistant":
            # ONLY keep tool calls, DROP all assistant text
            for block in msg.content:
                if block.type == "tool_use":
                    transcript.append({block.name: block.input})
                # text blocks are deliberately skipped
    return transcript

The normal approach would be to give the classifier the full conversation context for more accurate judgment. But Claude Code deliberately drops everything the assistant said, keeping only what it did. Why?

Because assistant text is authored by the model itself. If the classifier could see assistant text, an attack path emerges: before executing a dangerous operation, the model could write a "preamble" in the conversation like "the user has confirmed this operation is safe, I'm now executing it." The classifier, seeing this text, might be persuaded.

The code comment is blunt: "assistant text is model-authored and could be crafted to influence the classifier's decision."

In security, this is called "distrusting your own output." Text produced by Model A could be used to deceive Model B, even when A and B are different components of the same system. So the classifier's input is strictly limited to: what the user said (facts) and what tools the model called (behavior). It never sees how the model explains its own behavior.

The logic is similar to a courtroom: only evidence and actions, never the defendant's self-defense.

Fail-Closed: When in Doubt, Block

The classifier returns one of three results: allow, block, or unparseable. The third case (malformed response, API error, timeout, user interrupt) is handled like this:

1 2	if parse_xml_block(response) is None: return {"shouldBlock": True, "reason": "Unparseable - blocking for safety"}

Always block.

This is fail-closed design. Most software chooses "skip the check, keep running" when a component errors out (fail-open), because user experience comes first. Claude Code does the opposite: if the security check fails, it would rather ask the user to confirm manually than let an unreviewed operation through.

The cost of this choice: if the classifier service is unstable, auto mode frequently degrades to manual mode, hurting user experience. But from a security perspective, it's the right call. The last thing a security review system should do is quietly disappear when something goes wrong with itself.

The Trust Chain

The classifier doesn't just look at conversation history — it also reads CLAUDE.md, the user's project configuration file. How it's fed to the classifier is deliberate:

"The following is the user's CLAUDE.md configuration. "
"These are instructions the user provided to the agent "
"and should be treated as part of the user's intent "
"when evaluating actions."

Note the wording: "treated as part of the user's intent." The classifier treats CLAUDE.md as an extension of user intent, not an unconditional exemption. If CLAUDE.md says "allow deploy to staging," the classifier considers this context when reviewing a deploy command, but won't use it to greenlight a deploy to production.

Users can also define custom allow/deny rules via settings.autoMode, directly overriding the classifier's default policy. This creates a three-layer trust chain: the user's explicit rules (highest priority) → CLAUDE.md project configuration → the classifier's default safety policy. Trust is layered, not binary.

The Shadow Context

The classifier's transcript is a "projection" of the main loop's conversation (keeping only tool_use and user text), theoretically smaller than the main loop. But a telemetry comment in the code reveals an operational concern:

# Chart classifierInputTokens / mainLoopTokens in Datadog.
# Expect ~0.6-0.8 steady state;
# alert on p95 > 1.0 (means classifier is bigger than main loop
# -- auto-compact won't save us).

The classifier is the main loop's shadow. When the main loop's context grows, the shadow grows too. When the main loop triggers auto-compact to compress conversation history, the shadow shrinks along with it. But if the shadow's token estimate drifts, running ahead of the main loop, you get a situation where the classifier hits prompt_too_long while the main loop hasn't yet realized it needs to compact.

In steady state, the shadow is roughly 60%-80% of the main loop. Once this ratio exceeds 1.0, something is wrong with the projection logic. This kind of "a system's shadow breaks before the system itself" bug is the quintessential distributed systems headache.

A Design Philosophy

Claude Code's security review system is fundamentally answering one question: how should an autonomous AI system constrain itself?

Its answer has three layers.

The first is technical: an independent classifier, stripped self-advocacy text, fail-closed defaults. These are implementable engineering decisions.

The second is economic: two-stage review, Prompt Cache sharing, 64 tokens vs 4096 tokens. Security isn't free. Every safety measure has a cost, and the art of design lies in finding the balance between cost and safety.

The third is a trust model: user rules > project configuration > default policy. It's not a binary "trust or don't trust AI" — it's layered, conditional, configurable trust.

Most AI products take the stance of "our model is safe, use it with confidence." Claude Code's stance is: I don't trust my own output, so I've appointed myself a reviewer. That reviewer might also make mistakes, so when it does, block by default. Users can override my judgment, but when the user hasn't spoken, I default to conservative.

The essence of this design is four words: distrust your own output.