Finisky Garden

NLP, Software Engineering, Product Design

0%

RAG re-retrieves, re-assembles, and re-reasons on every query. Ask something that requires synthesizing five documents and the model has to find all five, stitch them together, and derive the answer from scratch. Ask ten times, retrieve ten times. Nothing accumulates.

Karpathy recently posted a gist called LLM Wiki proposing a different approach: instead of retrieving at query time, have the LLM pre-compile knowledge into a structured wiki and query the compiled result.

Dense retrieval has become the default first stage in RAG pipelines. Encode documents into vectors, encode queries into vectors, compute cosine similarity, done. But a basic question rarely gets asked: for a d-dimensional embedding, how many distinct top-k retrieval results can it actually represent?

An ICLR 2026 paper from Google DeepMind and JHU, “On the Theoretical Limitations of Embedding-Based Retrieval”, gives a mathematical answer: not enough. Not even close.

Justin Sun (the crypto guy) recently dropped a hot take: “It’s 2026 already — if you can talk to AI, stop talking to humans.” He also said something about deleting contacts born before 1990 and WeChat being for old people. Classic Justin Sun. Take it with a grain of salt.

But strip away the absurd parts, and “talk to AI more” is something I actually agree with as a heavy user. Here’s why, and where it falls apart.

There’s a module inside Claude Code called autoDream. Its prompt title reads “Dream: Memory Consolidation.”

This isn’t a metaphor. Claude Code actually spins up a background sub-agent that reviews transcripts from past sessions, consolidates scattered memories — merging, deduplicating, correcting — and writes them back to disk. The whole thing is invisible unless you dig into the background task list.

Tech job boards in 2025 are schizophrenic. Traditional software engineering roles are shrinking. “AI”-prefixed positions are expanding. Same company, same quarter — cutting junior devs and project managers on one side, opening Agent orchestration engineers and AI application architects on the other.

Last June, Shopify CEO Tobi Lutke posted that he preferred “context engineering” over “prompt engineering.” Karpathy retweeted with a +1. Simon Willison wrote a blog post saying the term might actually stick. Phil Schmid published a full definition. Half the AI community switched terminology within a week.

By early 2026, Phil Schmid introduced another term: Agent Harness. It didn’t generate the same buzz, but anyone building Coding Agents quietly nodded along.

After OpenClaw crossed 350K stars, a narrative started forming in the community: since both run on Opus 4.6 under the hood, the open-source option should be on par with Claude Code. Anyone who has actually used both probably shares the same observation — in long sessions, OpenClaw starts losing context, forgetting files it already read, redoing work it already did. Claude Code does too, but noticeably later, and it recovers much better.

Same model, different experience. Why?

Cursor’s parent company Anysphere has about 150 employees. In November 2025, its ARR crossed $1 billion . OpenAI, as of early 2026, has 4,500 employees . Its 2025 revenue was $13.1 billion, but according to Fortune, it lost roughly $9 billion and doesn’t expect to turn profitable until 2028.

An application company that trains zero models is outproducing, per capita, the company that trains them. This is the most telling signal in AI for 2025.

In late November 2025, an open-source project called OpenClaw went live on GitHub. Four and a half months later, it had 350K stars, 70K forks, 81 releases, and sponsorships from OpenAI, NVIDIA, and Vercel. For comparison: Open WebUI took two and a half years to reach 130K stars; NextChat took three years to hit 88K. Growth like OpenClaw’s is rare in GitHub’s history.

It isn’t a new model, a training framework, or even a “technical breakthrough” in the traditional sense. It’s a personal AI assistant that runs on your own machine and talks to you through the chat apps you already use: WhatsApp, Telegram, Slack, Discord, WeChat, Feishu, iMessage, Matrix, and over 25 platforms in total, all connected to a single backend.

Why did it break out of the developer bubble?

When you use Claude Code, there’s something you probably never notice: it has over 40 registered tools, but when you ask it to read a file or edit a few lines of code, it only uses three or four. The definitions for the remaining 30-plus tools, each around 500 tokens, add up to over 10,000 tokens of fixed overhead per request. You just want to change one line of CSS, but you’re paying for WebSearch, NotebookEdit, CronCreate, and a bunch of tools you’ll never touch.

Claude Code’s Edit tool has a deceptively simple interface: give it an old_string, give it a new_string, and it finds the former in a file and replaces it with the latter. Sounds like nothing more than a str.replace(). But in the context of an LLM Agent, this seemingly trivial operation is backed by an entire engineering pipeline spanning string sanitization to concurrency safety. The model stuffs line numbers into its replacement strings. It conjures curly quotes out of thin air. External tools modify the target file while the user is still reviewing the permission dialog. The Edit tool has to stay correct through all of this — far more than find-and-replace can handle.

From observing its behavior, the Edit tool’s execution breaks down into three phases: API-layer preprocessing (before the tool even receives input), input validation (before the permission dialog is shown), and the actual write (after the user approves). Each phase handles a distinct class of problems and maintains deliberate sync/async boundaries.

Claude Code has a mode that appears in no documentation whatsoever. When active, it systematically erases every trace of AI involvement. No Co-Authored-By trailer, no “Generated with Claude Code” footer, and the system prompt itself doesn’t even tell the model what it is. This mode is called Undercover Mode. It exists only in Anthropic’s internal builds; external users will never see it, because dead code elimination strips the entire feature out during public builds.

The behavioral implications are telling: this mechanism exists because Anthropic employees routinely use Claude Code to commit to public repositories. Without some form of protection, commit messages might contain unreleased model codenames, PR descriptions might expose internal project names, and model identifiers in the system prompt could leak through some vector or another. Undercover Mode is designed to plug all of these holes.

When tackling complex tasks, Claude Code spawns multiple sub-agents in parallel, each needing the full parent conversation context. If the parent has accumulated 100K tokens and three sub-agents are spawned, a naive implementation charges 300K tokens of input.

Anyone familiar with LLM inference optimization will recognize this immediately: it’s a KV Cache sharing problem. When multiple requests share the same prefix, the Attention layer’s Key/Value tensors can be reused, skipping redundant computation. Anthropic exposes this capability to API users as Prompt Cache, offering a 90% discount on cached prefix portions — but only if the prefix bytes are exactly identical across requests. Claude Code’s fork sub-agents are deliberately constructed so that over 99% of the bytes are identical, compressing the effective input cost of three sub-agents to roughly 120K token-equivalent (100K full price + 2 × 100K × 10%).

A 200K context window sounds generous — until you’re in a moderately complex coding session. Read a few dozen files, run several rounds of grep, execute some bash commands, and you’ve already burned through most of it. Compaction is inevitable, but compaction itself costs money: you need an LLM call to generate a summary, and the input to that call is the very context you’re trying to compress. This creates a fascinating engineering trade-off: compact too early and you lose useful information; compact too late and the window overflows; and the cost of compaction itself can’t be ignored. Claude Code’s answer is a multi-layer cascade: avoid compaction if you can, do it cheaply if you must, and only call the LLM as a last resort.

The security challenge of AI-executed Bash commands isn’t “should we trust the model” — it’s “how do we make sure a command actually means what it looks like.”

Claude Code lets AI execute Bash commands directly. Not through a structured interface like MCP; it literally gives the model a shell. MCP’s approach wraps tools into JSON schemas, which is safe enough, but you can’t realistically write adapters for thousands of CLI tools. The capability ceiling is obvious. A raw shell can do anything; the tradeoff is that the security problem shifts from “controlling interface permissions” to “figuring out what a command actually does.” In the previous post, I covered how YOLO Classifier uses AI to review AI, but the Classifier works with the full command string and makes a semantic judgment: is this operation dangerous? Before that judgment even happens, there’s a deeper question that needs answering: does this command actually mean what it appears to mean?

Claude Code has an auto mode that executes operations without confirmation. But “auto” doesn’t mean “unreviewed” — there’s a classifier watching every action.

The Auto Mode Paradox

One of the most annoying things about using Claude Code is the permission popups. Every Bash command, every file write requires a confirmation click. Power users turn on auto mode, letting Claude execute everything autonomously without asking.

This creates an obvious problem: what if the model decides to rm -rf /, push to the production branch, or write a backdoor into .bashrc?

Claude Code has no vector database and no embedding index, yet it can pinpoint the exact file you need in a million-line codebase. Behind this is a retrieval architecture completely different from traditional RAG.

This Isn’t the RAG You Know

If you’ve used RAG before, the pipeline should be familiar: build an offline index, user asks a question, vector-search for Top-K chunks, inject into prompt, generate an answer. A straight line, one pass, done.

Claude Code doesn’t work like that at all. It has no offline index. The model itself drives the retrieval process.

Nassim Taleb proposed a thought experiment in Fooled by Randomness: given infinite monkeys typing on infinite typewriters, one of them will eventually produce the complete text of the Iliad.

The more I think about it, the more I believe this story’s endgame is today’s large language models.

Developers who’ve used Claude Code probably share this experience: even in an ultra-long conversation where dozens of files have been modified, it seems to always “remember” what it did before. Even more remarkably, if you told it “I prefer bun over npm” in a previous session, it automatically follows that preference next time.

Behind this is a sophisticated memory management system. Let’s tear apart Claude Code’s memory mechanism layer by layer.