Foundations of Harness Engineering
Define harness engineering, its scope, and why system design matters more than prompt wording for long-running agents.
Key takeaways
- Harness engineering is system design around what an agent can see, how it judges progress, where it must stop, and how its work is verified.
- The bottleneck shifts with work length: prompt for one response, context for multi-file work, harness for multi-hour work, and platform for team use.
- Most long-running failures come from thin repo context, vague done criteria, weak verification, stale docs, and undefined approval points, not weak models.
- In 2026 the harness became infrastructure for approval, isolation, tool access, audit logs, and provider setup via SDKs, MCP, sandboxes, hooks, and plugins.
- Stronger models simplify harness details but do not remove the need for work environments, evaluation criteria, approval structures, and operating loops.
Harness engineering is less about telling an agent what to do and more about designing what it can see, how it judges progress, where it must stop, and how its work is verified.
The Bottleneck Moves as Work Gets Longer
| Work type | Main bottleneck | Needed system |
|---|---|---|
| One response | Expression | Prompt |
| Several files | Information exposure | Context |
| Multi-hour work | Planning, verification, approval | Harness |
| Repeated team use | Distribution, standards, operations | Platform |
Why Prompting Is Not Enough
Prompt quality can matter a lot for short work. But as tasks become larger and repositories become more complex, failures come from other places.
- The repository does not contain enough current context.
- The definition of done is vague, so the agent declares victory early.
- Verification is weak, so the output only looks correct.
- Old docs and old rules pollute context.
- Approval points are not defined, so risky and safe work are treated alike.
The issue is often not that the model is weak. The issue is that the workbench around the model is poorly designed.
What Changed
Recent coding agents can run longer, use browsers and shells, inspect logs, and interact with deployment systems. That makes the question less about "how do I get one good answer?" and more about "how do I keep a multi-hour task aligned and auditable?"
OpenAI's 2026-02-11 harness engineering article emphasized agent-readable repositories, short AGENTS.md files,
structured docs/, browser and observability access, and documentation gardening. Anthropic's 2026-03-24 harness
article emphasized when planner and evaluator scaffolding is load-bearing and when stronger models let teams remove
unneeded scaffolding.
Anthropic's 2026-03-25 auto mode article and 2026-04-08 Managed Agents article extend that view. Permission prompts become a policy and classification problem, not just a UI prompt. Sessions, harnesses, and sandboxes become separate failure, security, and scaling boundaries. The key question is not "how autonomous can the agent be?" It is what can be auto-approved, what needs a classifier, and which credentials must never be reachable from generated code.
In April and May 2026, the same ideas also moved into product primitives. OpenAI Agents SDK exposed model-native
harnesses, native sandbox execution, MCP, skills, AGENTS.md, shell, apply_patch, memory, and compaction patterns.
Codex added remote connections, hooks, programmatic access tokens, Secure MCP Tunnel, and the OpenAI Developers plugin.
The harness is becoming infrastructure for approval, isolation, tool access, audit logs, and provider setup.
Core Observation
Stronger models can simplify the details of a harness. They do not remove the need for work environments, evaluation criteria, approval structures, and operating loops.
2026 Source Questions
| Date | Source | Question this book uses |
|---|---|---|
| 2026-02-11 | OpenAI Harness Engineering | How should repositories and docs become agent-readable? |
| 2026-02-26 | Toss harness article | How do personal habits become a team execution system? |
| 2026-03-24 | Anthropic harness article | How do we know which scaffolding is load-bearing? |
| 2026-03-25 | Anthropic Claude Code auto mode | How do we reduce approval fatigue while classifying risky behavior? |
| 2026-04-08 | Anthropic Managed Agents | How should session, harness, and sandbox be separated? |
| 2026-04-15 | OpenAI Agents SDK update | How should model-native harnesses and sandbox execution be treated as standard primitives? |
| 2026-05-05 | Anthropic financial agents | How do domain templates package skills, connectors, subagents, and approval flows? |
| 2026-05-06 | OpenAI API changelog | How do TypeScript sandbox agents and open-source harnesses widen team options? |
| 2026-05-07 | OpenAI Developers plugin for Codex | Can provider setup and API troubleshooting become a plugin surface? |
| 2026-05-14 | Codex remote / hooks update | How do approvals, redirection, and validation persist across devices and environments? |
| 2026-05-19 | Secure MCP Tunnel | How do private MCP servers connect without public internet exposure? |
| 2026-05-23 read baseline | gstack, revfactory/harness | How do opinionated workflows and generated team architectures transfer to a local team? |
What a Harness Produces
A good harness leaves concrete artifacts in the repository and runtime.
- A short and current
AGENTS.mdor equivalent entry document. - Version-controlled architecture and domain rules.
- Plans, release gates, test routines, and QA checklists.
- Slash commands, skills, agent definitions, and plugins.
- Workflows showing who approves what and when.
- Sandbox, MCP, hooks, remote connections, and permission classifier policy.
- Provider plugins, domain templates, and connectors for repeated setup.
- A maintenance loop that removes stale rules.
Symptoms of a Team Without a Harness
| Symptom | Missing piece |
|---|---|
| Outcomes vary widely across people | Shared entry docs, criteria, workflows |
| Reviews catch the same issues late | Evaluator, QA, browser loop |
| Many docs, little usage | Short TOC doc, executable SSOT |
| New models cause quality swings | Evaluation baselines and rollback criteria |
| Only the expert gets good results | Team commands, skills, and templates |
Antipatterns
- Putting every rule into one giant file.
- Relying on implicit "you know what I mean" context.
- Managing docs and workflows separately.
- Generating more output without evaluation criteria.
- Copying an external harness without domain rules.
Sentence to Remember
Harness engineering is not a secret prompt technique. It is the system design around the places where models are most likely to fail.
Harness Engineering
A practical guide to harness design, evaluation, and operations based on OpenAI, Anthropic, Toss, gstack, revfactory, Agents SDK, and Managed Agents patterns
Repo-Readable Systems
Use AGENTS.md, docs, observability, executable SSOT, MCP, skills, hooks, and plugins as one work environment.