Three Evolutions of Agent Engineering

Posted on 2026-04-09 In NLP 评论: Views:

Last June, Shopify CEO Tobi Lutke posted that he preferred "context engineering" over "prompt engineering." Karpathy retweeted with a +1. Simon Willison wrote a blog post saying the term might actually stick. Phil Schmid published a full definition. Half the AI community switched terminology within a week.

By early 2026, Phil Schmid introduced another term: Agent Harness. It didn't generate the same buzz, but anyone building Coding Agents quietly nodded along.

Three terms, three phases of focus. But there's a problem with how the industry discusses them: many people treat them as iterative synonyms, as if context engineering is prompt engineering 2.0 and harness engineering is 3.0. That's wrong. They solve problems at entirely different levels, and the relationship between them is containment, not replacement.

Prompt: How You Talk to the Model

Prompt engineering took off in 2022-2023. The core scenario is simple: you're facing a text box, trying to get the model to produce the output you want. Chain-of-thought, few-shot examples, role-playing, format constraints — these techniques all answer the same question: how do you accurately translate your intent into text the model can understand.

This is a language problem. You're communicating with an "alien" whose native language is token sequences while yours is natural human language. Prompt engineering is learning the art of that translation. Do it well and the model gets you; do it poorly and you get hallucinations and drift.

The key limitation: prompt engineering assumes a single exchange between you and the model. You pack everything into one request, the model returns one result, interaction over. Even in multi-turn scenarios, optimizing each turn's prompt is still fundamentally about polishing that static text. Karpathy put it well — most people's intuitive understanding of the term is "the art of typing things into a chatbot," and unfortunately that intuition isn't far off.

Context: What the Model Sees

In mid-2025, Tobi Lutke defined it as "the art of providing all the context for the task to be plausibly solvable by the LLM." Karpathy's addition was more precise: context engineering is the delicate art and science of "filling the context window with just the right information for the next step," involving task descriptions, few-shot examples, RAG retrieval, multimodal data, tool definitions, state and history, compression — getting all of this right is highly non-trivial.

Phil Schmid later distilled the concept in a blog post: designing and building dynamic systems that provide the right information and tools, in the right format, at the right time, to give the LLM everything it needs to accomplish a task.

Note two key words: dynamic, system. Context engineering is no longer a string optimization problem — it's a systems engineering problem. You're not writing a prompt; you're writing a program whose output is the prompt. That program needs to decide whether this particular call should include the user's calendar data, retrieve relevant documents, attach a summary of previous conversations, or provide specific tools. Every decision differs because every task differs.

Phil Schmid gave an intuitive example: given the same email saying "let's chat tomorrow," an agent with poor context can only reply "what time works for you?" An agent with good context can directly say "tomorrow's fully booked, how about Thursday morning? Sent you an invite." The second response is better not because the prompt was craftier, but because the system had already pulled calendar, contacts, and email history before calling the model. Prompt engineering cares about what you said to the model; context engineering cares about what the model saw.

Phil Schmid put it bluntly in his context engineering blog post: most of the time when agents are unreliable, the root cause isn't model capability — it's that the right context wasn't delivered to the model. This runs counter to many people's intuition. Everyone thinks they need to wait for a stronger model, but often a task that Opus 4.6 can't handle becomes solvable by Sonnet 4 once you fill in the context.

Harness: Who Manages the Agent's Lifetime

In early 2026, Phil Schmid published "The importance of Agent Harness in 2026." He used a computer analogy: the model is the CPU, the context window is RAM, the Agent Harness is the operating system, and the Agent is the application running on that OS.

This analogy highlights a dimension that context engineering discussions overlook: time. The core scenario in context engineering is "how to prepare context before the next LLM call," but the reality of agents is dozens or even hundreds of consecutive LLM calls spanning minutes to days. At that scale, context engineering is just the memory management module inside the operating system. You still need process scheduling, file systems, drivers, and interrupt handling.

What the harness manages includes but isn't limited to: how context is compressed and reclaimed in long sessions (memory management), how sub-agents are assigned tasks and their contexts isolated (process management), how to retry and degrade when tool calls fail (exception handling), how to detect and correct when the model's attention starts drifting (watchdog), how to restore state when the user leaves for an hour and comes back (hibernate/wake), and how to record the decision trace at each step for subsequent training (logging).

Take Claude Code as a concrete example. Its context compression uses a four-layer cascade: zero-cost time-based clearing, server-side cache editing that preserves the cache prefix, background notes as free summaries, and only then a full LLM summary as a last resort. This layered logic isn't something context engineering can describe — it's a complete runtime resource management strategy, echoing the design philosophy of multi-level caching plus virtual memory in operating systems.

Phil Schmid also offered an insight in his article: the harness's value isn't just making the current agent run better. The execution trace data it produces is itself training material for the next generation of models. When the model starts ignoring instructions at step 100, the harness records it, that data flows back into the training pipeline, and the next model version won't start drifting until step 200. This feedback loop is entirely absent from pure context engineering discussions.

Three Layers Nested, Not Replacing

To clarify the relationship between all three:

+------------------------------------------------+
| Harness Engineering                            |
|  Lifecycle, compaction, sub-agents,            |
|  hooks, state recovery, drift detection        |
|                                                |
|  +-------------------------------------------+ |
|  | Context Engineering                       | |
|  |  RAG, memory, tools, history,             | |
|  |  dynamic assembly per LLM call            | |
|  |                                           | |
|  |  +--------------------------------------+ | |
|  |  | Prompt Engineering                   | | |
|  |  |  Instructions, format, tone,         | | |
|  |  |  chain-of-thought, few-shot          | | |
|  |  +--------------------------------------+ | |
|  +-------------------------------------------+ |
+------------------------------------------------+

Prompt engineering is the kernel — solving "how to talk to the model." Context engineering is the middle layer — solving "what the model sees before responding." Harness engineering is the outer shell — solving "resource scheduling and state management across the agent's entire lifecycle."

Their decision frequencies differ too. Prompt engineering decisions happen at system design time — write once, rarely change. Context engineering decisions happen before every LLM call — potentially different each time. Harness engineering decisions span the entire session lifecycle — behavior at step 1 and step 100 may be completely different.

What Happens When You Mix Them Up

Confusing the layers leads to solving problems at the wrong level.

An agent repeatedly fails on a particular tool call. If you think it's a context problem, you'll check the tool definition format and whether there's enough relevant information in the context. But the real issue might be at the harness layer: no retry mechanism, no degradation strategy, the model stuck in a dead loop. The reverse is equally true — the model can't understand your instructions and keeps going off track, but tweaking the harness layer's compression strategy and sub-agent orchestration won't help. The problem is in the innermost layer: the prompt isn't clear enough.

Phil Schmid cited Rich Sutton's Bitter Lesson, arguing that the harness must stay lightweight because each new generation of models renders the previous generation's complex control flows redundant. He's right, but only about the harness layer. Prompt engineering techniques do depreciate with model upgrades — last year's chain-of-thought hints aren't needed this year. Context engineering is relatively stable, because "models need to see relevant information to make good decisions" doesn't go away with better models. Harness engineering's core value is also shifting — the larger the context window and the more durable the model, the less governance the harness needs to do. But as long as windows are finite and models drift, the harness has a reason to exist.

When an agent starts repeating itself in the back half of a long session, adding "don't repeat" to the prompt won't help — the problem is in the harness layer's compression strategy. Knowing which layer the problem lives in is the first step to knowing where to push.