Foundations of Harness Engineering

Define harness engineering, its scope, and why system design matters more than prompt wording for long-running agents.

Key takeaways

Harness engineering is system design around what an agent can see, how it judges progress, where it must stop, and how its work is verified.
The bottleneck shifts with work length: prompt for one response, context for multi-file work, harness for multi-hour work, and platform for team use.
Most long-running failures come from thin repo context, vague done criteria, weak verification, stale docs, and undefined approval points, not weak models.
In 2026 the harness became infrastructure for approval, isolation, tool access, audit logs, and provider setup via SDKs, MCP, sandboxes, hooks, and plugins.
Stronger models simplify harness details but do not remove the need for work environments, evaluation criteria, approval structures, and operating loops.

Harness engineering is less about telling an agent what to do and more about designing what it can see, how it judges progress, where it must stop, and how its work is verified.

The Bottleneck Moves as Work Gets Longer

Work type	Main bottleneck	Needed system
One response	Expression	Prompt
Several files	Information exposure	Context
Multi-hour work	Planning, verification, approval	Harness
Repeated team use	Distribution, standards, operations	Platform

Why Prompting Is Not Enough

Prompt quality can matter a lot for short work. But as tasks become larger and repositories become more complex, failures come from other places.

The repository does not contain enough current context.
The definition of done is vague, so the agent declares victory early.
Verification is weak, so the output only looks correct.
Old docs and old rules pollute context.
Approval points are not defined, so risky and safe work are treated alike.

The issue is often not that the model is weak. The issue is that the workbench around the model is poorly designed.

Recent coding agents can run longer, use browsers and shells, inspect logs, and interact with deployment systems. That makes the question less about "how do I get one good answer?" and more about "how do I keep a multi-hour task aligned and auditable?"

OpenAI's 2026-02-11 harness engineering article emphasized agent-readable repositories, short AGENTS.md files, structured docs/, browser and observability access, and documentation gardening. Anthropic's 2026-03-24 harness article emphasized when planner and evaluator scaffolding is load-bearing and when stronger models let teams remove unneeded scaffolding.

Anthropic's 2026-03-25 auto mode article and 2026-04-08 Managed Agents article extend that view. Permission prompts become a policy and classification problem, not just a UI prompt. Sessions, harnesses, and sandboxes become separate failure, security, and scaling boundaries. The key question is not "how autonomous can the agent be?" It is what can be auto-approved, what needs a classifier, and which credentials must never be reachable from generated code.

In April and May 2026, the same ideas also moved into product primitives. OpenAI Agents SDK exposed model-native harnesses, native sandbox execution, MCP, skills, AGENTS.md, shell, apply_patch, memory, and compaction patterns. Codex added remote connections, hooks, programmatic access tokens, Secure MCP Tunnel, and the OpenAI Developers plugin. The harness is becoming infrastructure for approval, isolation, tool access, audit logs, and provider setup.

Core Observation

Stronger models can simplify the details of a harness. They do not remove the need for work environments, evaluation criteria, approval structures, and operating loops.

2026 Source Questions

Date	Source	Question this book uses
2026-02-11	OpenAI `Harness Engineering`	How should repositories and docs become agent-readable?
2026-02-26	Toss harness article	How do personal habits become a team execution system?
2026-03-24	Anthropic harness article	How do we know which scaffolding is load-bearing?
2026-03-25	Anthropic Claude Code auto mode	How do we reduce approval fatigue while classifying risky behavior?
2026-04-08	Anthropic Managed Agents	How should session, harness, and sandbox be separated?
2026-04-15	OpenAI Agents SDK update	How should model-native harnesses and sandbox execution be treated as standard primitives?
2026-05-05	Anthropic financial agents	How do domain templates package skills, connectors, subagents, and approval flows?
2026-05-06	OpenAI API changelog	How do TypeScript sandbox agents and open-source harnesses widen team options?
2026-05-07	OpenAI Developers plugin for Codex	Can provider setup and API troubleshooting become a plugin surface?
2026-05-14	Codex remote / hooks update	How do approvals, redirection, and validation persist across devices and environments?
2026-05-19	Secure MCP Tunnel	How do private MCP servers connect without public internet exposure?
2026-05-23 read baseline	`gstack`, `revfactory/harness`	How do opinionated workflows and generated team architectures transfer to a local team?

What a Harness Produces

A good harness leaves concrete artifacts in the repository and runtime.

A short and current AGENTS.md or equivalent entry document.
Version-controlled architecture and domain rules.
Plans, release gates, test routines, and QA checklists.
Slash commands, skills, agent definitions, and plugins.
Workflows showing who approves what and when.
Sandbox, MCP, hooks, remote connections, and permission classifier policy.
Provider plugins, domain templates, and connectors for repeated setup.
A maintenance loop that removes stale rules.

Symptoms of a Team Without a Harness

Symptom	Missing piece
Outcomes vary widely across people	Shared entry docs, criteria, workflows
Reviews catch the same issues late	Evaluator, QA, browser loop
Many docs, little usage	Short TOC doc, executable SSOT
New models cause quality swings	Evaluation baselines and rollback criteria
Only the expert gets good results	Team commands, skills, and templates

Antipatterns

Putting every rule into one giant file.
Relying on implicit "you know what I mean" context.
Managing docs and workflows separately.
Generating more output without evaluation criteria.
Copying an external harness without domain rules.

Sentence to Remember

Harness engineering is not a secret prompt technique. It is the system design around the places where models are most likely to fail.

Key takeaways

Harness engineering is system design around what an agent can see, how it judges progress, where it must stop, and how its work is verified.
The bottleneck shifts with work length: prompt for one response, context for multi-file work, harness for multi-hour work, and platform for team use.
Most long-running failures come from thin repo context, vague done criteria, weak verification, stale docs, and undefined approval points, not weak models.
In 2026 the harness became infrastructure for approval, isolation, tool access, audit logs, and provider setup via SDKs, MCP, sandboxes, hooks, and plugins.
Stronger models simplify harness details but do not remove the need for work environments, evaluation criteria, approval structures, and operating loops.

Harness engineering is less about telling an agent what to do and more about designing what it can see, how it judges progress, where it must stop, and how its work is verified.

The Bottleneck Moves as Work Gets Longer

Work type	Main bottleneck	Needed system
One response	Expression	Prompt
Several files	Information exposure	Context
Multi-hour work	Planning, verification, approval	Harness
Repeated team use	Distribution, standards, operations	Platform

Why Prompting Is Not Enough

Prompt quality can matter a lot for short work. But as tasks become larger and repositories become more complex, failures come from other places.

The repository does not contain enough current context.
The definition of done is vague, so the agent declares victory early.
Verification is weak, so the output only looks correct.
Old docs and old rules pollute context.
Approval points are not defined, so risky and safe work are treated alike.

The issue is often not that the model is weak. The issue is that the workbench around the model is poorly designed.

What Changed

Core Observation

Stronger models can simplify the details of a harness. They do not remove the need for work environments, evaluation criteria, approval structures, and operating loops.

2026 Source Questions

Date	Source	Question this book uses
2026-02-11	OpenAI `Harness Engineering`	How should repositories and docs become agent-readable?
2026-02-26	Toss harness article	How do personal habits become a team execution system?
2026-03-24	Anthropic harness article	How do we know which scaffolding is load-bearing?
2026-03-25	Anthropic Claude Code auto mode	How do we reduce approval fatigue while classifying risky behavior?
2026-04-08	Anthropic Managed Agents	How should session, harness, and sandbox be separated?
2026-04-15	OpenAI Agents SDK update	How should model-native harnesses and sandbox execution be treated as standard primitives?
2026-05-05	Anthropic financial agents	How do domain templates package skills, connectors, subagents, and approval flows?
2026-05-06	OpenAI API changelog	How do TypeScript sandbox agents and open-source harnesses widen team options?
2026-05-07	OpenAI Developers plugin for Codex	Can provider setup and API troubleshooting become a plugin surface?
2026-05-14	Codex remote / hooks update	How do approvals, redirection, and validation persist across devices and environments?
2026-05-19	Secure MCP Tunnel	How do private MCP servers connect without public internet exposure?
2026-05-23 read baseline	`gstack`, `revfactory/harness`	How do opinionated workflows and generated team architectures transfer to a local team?

What a Harness Produces

A good harness leaves concrete artifacts in the repository and runtime.

A short and current AGENTS.md or equivalent entry document.
Version-controlled architecture and domain rules.
Plans, release gates, test routines, and QA checklists.
Slash commands, skills, agent definitions, and plugins.
Workflows showing who approves what and when.
Sandbox, MCP, hooks, remote connections, and permission classifier policy.
Provider plugins, domain templates, and connectors for repeated setup.
A maintenance loop that removes stale rules.

Symptoms of a Team Without a Harness

Symptom	Missing piece
Outcomes vary widely across people	Shared entry docs, criteria, workflows
Reviews catch the same issues late	Evaluator, QA, browser loop
Many docs, little usage	Short TOC doc, executable SSOT
New models cause quality swings	Evaluation baselines and rollback criteria
Only the expert gets good results	Team commands, skills, and templates

Antipatterns

Putting every rule into one giant file.
Relying on implicit "you know what I mean" context.
Managing docs and workflows separately.
Generating more output without evaluation criteria.
Copying an external harness without domain rules.

Sentence to Remember

Harness engineering is not a secret prompt technique. It is the system design around the places where models are most likely to fail.

Foundations of Harness Engineering

The Bottleneck Moves as Work Gets Longer

Why Prompting Is Not Enough

What Changed

2026 Source Questions

What a Harness Produces

Symptoms of a Team Without a Harness

Antipatterns

Sentence to Remember

On This Page

Foundations of Harness Engineering

The Bottleneck Moves as Work Gets Longer

Why Prompting Is Not Enough

What Changed

2026 Source Questions

What a Harness Produces

Symptoms of a Team Without a Harness

Antipatterns

Sentence to Remember

On This Page