Engineering Mechanics
Explain why harnesses are engineering systems across input, state, tools, evaluation, approval, sandboxing, classifiers, and cleanup.
Key takeaways
- A harness is engineering because it handles eight concerns: input, externalized state, tool permissions, evaluation, human handoff, cleanup, harness/compute separation, and auto-approval policy.
- It changes failure modes, reducing outcome variance, bounding blast radius, and making work reproducible from files and loops rather than personal taste.
- A task contract compresses the starting input shared by planner, builder, and evaluator, and external state files enable handoff and post-failure reconstruction.
- Approval is a control system: an allowlist, project-local edits, classifier gates, trust boundaries, deny-and-continue, and hard human gates.
- Managed Agents separates session (append-only log), harness, and sandbox, with credentials behind a vault or MCP proxy, not inside generated code.
Harness engineering can sound abstract because "create a better environment" hides the actual system design work.
As engineering, a harness handles eight things.
- What input starts the task.
- How work state is externalized.
- Which tools are opened under which permission boundary.
- Where quality is judged.
- When the system hands work to a human.
- How decayed rules get removed.
- How the harness is separated from execution compute.
- How auto-approval and auto-denial are delegated to policy or classifiers.
Why It Counts as Engineering
A harness changes the system's failure modes.
| What it changes | Why this is engineering |
|---|---|
| Variance | Same model, less outcome spread |
| Blast radius | Bad execution is bounded |
| Observability | Failures are traceable through browser, logs, and tests |
| Reproducibility | Work depends on files, commands, docs, and loops, not personal taste |
| Operating cost | Fewer repeated failures, missed reviews, and stale docs |
Core View
A harness does not ask the model to be smarter. It designs the system so model mistakes are less damaging.
1. Design the Input Layer
Good harnesses start differently.
OpenAI emphasizes AGENTS.md and structured docs/; Anthropic uses a sprint contract.
Both compress the starting input.
task_contract:
goal: "What must be finished"
non_goals:
- "What this turn must not do"
constraints:
- "Rules that must not be broken"
files_to_read:
- "AGENTS.md"
- "docs/architecture.md"
validation:
- "lint"
- "browser QA"
escalation:
- "Ask for human approval if schema changes appear"This is contract data shared by planner, builder, and evaluator.
2. Externalize State
Long agent work needs external state more than memory.
External state allows handoff, evaluation against artifacts, and post-failure reconstruction.
3. Tool Access Changes Quality
OpenAI treats browser, logs, and metrics as part of the harness because text-only agents can inspect intent but not runtime behavior.
| Tool | Without it | With it |
|---|---|---|
| Browser | UI breakage is missed | Interaction and state transitions are verified |
| Logs | Root cause is guessed | Failure is reconstructed by time and event |
| Tests | Small edits cause regressions | Repeated failures are caught |
| Metrics | Production quality decay is late | Results connect to operating data |
4. Separate Evaluation Loops
Anthropic's main lesson is not "use many roles." It is separate bias.
Evaluation becomes engineering when:
- builder and evaluator have different inputs;
- the evaluator asks what was proven, not what was built;
- QA checks actual execution instead of code explanation.
5. Design Approval Boundaries
A harness is also a control system.
approval_policy:
auto:
- "docs edits"
- "local refactors"
review_required:
- "user-facing UI changes"
- "test changes"
human_gate:
- "production deploy"
- "database migration"
- "permission, billing, or policy changes"This defines the blast radius the system can tolerate.
6. Treat Auto Approval as Policy
More permission prompts do not automatically make a system safer. Anthropic's auto mode work shows that approval fatigue turns safety into a classifier and policy design problem.
| Layer | Design question |
|---|---|
| Safe allowlist | Which actions are almost always safe, such as read-only exploration? |
| Project-local edits | Which repo-local edits are safe because they are version controlled? |
| Classifier gate | Which shell, web, external-tool, subagent, or out-of-repo actions require classification? |
| Trust boundary | Which GitHub orgs, buckets, APIs, or domains count as internal infrastructure? |
| Deny and continue | After denial, should the agent stop, find a safer route, or escalate? |
| Hard human gate | Which production, destructive, or security changes must never be auto-classified? |
The classifier reduces approval fatigue. It does not remove human review for high-risk work.
7. Separate Harness and Execution Compute
OpenAI Agents SDK and Anthropic Managed Agents both point toward a control-plane/data-plane split. The harness controls the loop; sandbox compute executes generated code and tool calls.
| Execution piece | Harness question |
|---|---|
| Shell | Which commands are allowed and when is approval required? |
Filesystem / apply_patch | Which paths can be edited and how is patch scope constrained? |
| Skills | Which knowledge bundles are loaded only when needed? |
| Memory / Compaction | Where does long-task state live and when is it compressed? |
| Manifest / mounted data | How are input data, output paths, and dependencies made predictable? |
| MCP / tunnel | Which internal tools are opened through which network boundary? |
Anthropic Managed Agents describes the same issue as separating session, harness, and sandbox.
The session is an append-only event log, not the model context window. The harness can wake up from the session log,
and external hands are exposed as execute(name, input) -> string. Credentials should live behind a vault or MCP
proxy, not inside generated sandbox code.
8. Build in Garbage Collection
Harnesses decay over time.
| Drift | Symptom |
|---|---|
| Doc drift | AGENTS no longer matches the codebase |
| Loop bloat | Unused reviewer steps remain |
| Approval bypass | Human gates are ignored under pressure |
| Tool aging | Browser or log scripts break silently |
A harness without cleanup becomes ritual.
Minimum Harness Architecture
| Layer | Artifact | Automation | Failure prevented |
|---|---|---|---|
| Input | AGENTS.md, invariants | file search, task contract | Wrong start, missing rules |
| Execution | plan, diff, command | editor, shell, workflow | Scope drift, unsupported implementation |
| Verification | test, browser QA, logs | runners, browser, observability | "Looks correct" failures |
| Record | QA report, updates | template, PR check | repeated failure, weak handoff |