Engineering Mechanics

Explain why harnesses are engineering systems across input, state, tools, evaluation, approval, sandboxing, classifiers, and cleanup.

Key takeaways

A harness is engineering because it handles eight concerns: input, externalized state, tool permissions, evaluation, human handoff, cleanup, harness/compute separation, and auto-approval policy.
It changes failure modes, reducing outcome variance, bounding blast radius, and making work reproducible from files and loops rather than personal taste.
A task contract compresses the starting input shared by planner, builder, and evaluator, and external state files enable handoff and post-failure reconstruction.
Approval is a control system: an allowlist, project-local edits, classifier gates, trust boundaries, deny-and-continue, and hard human gates.
Managed Agents separates session (append-only log), harness, and sandbox, with credentials behind a vault or MCP proxy, not inside generated code.

Harness engineering can sound abstract because "create a better environment" hides the actual system design work.

As engineering, a harness handles eight things.

What input starts the task.
How work state is externalized.
Which tools are opened under which permission boundary.
Where quality is judged.
When the system hands work to a human.
How decayed rules get removed.
How the harness is separated from execution compute.
How auto-approval and auto-denial are delegated to policy or classifiers.

Why It Counts as Engineering

A harness changes the system's failure modes.

What it changes	Why this is engineering
Variance	Same model, less outcome spread
Blast radius	Bad execution is bounded
Observability	Failures are traceable through browser, logs, and tests
Reproducibility	Work depends on files, commands, docs, and loops, not personal taste
Operating cost	Fewer repeated failures, missed reviews, and stale docs

Core View

A harness does not ask the model to be smarter. It designs the system so model mistakes are less damaging.

1. Design the Input Layer

Good harnesses start differently. OpenAI emphasizes AGENTS.md and structured docs/; Anthropic uses a sprint contract. Both compress the starting input.

task_contract:
  goal: "What must be finished"
  non_goals:
    - "What this turn must not do"
  constraints:
    - "Rules that must not be broken"
  files_to_read:
    - "AGENTS.md"
    - "docs/architecture.md"
  validation:
    - "lint"
    - "browser QA"
  escalation:
    - "Ask for human approval if schema changes appear"

This is contract data shared by planner, builder, and evaluator.

2. Externalize State

Long agent work needs external state more than memory.

task-contract.yaml

plan.md

review-notes.md

qa-report.md

release-checklist.md

architecture.md

runbooks.md

invariants.md

External state allows handoff, evaluation against artifacts, and post-failure reconstruction.

3. Tool Access Changes Quality

OpenAI treats browser, logs, and metrics as part of the harness because text-only agents can inspect intent but not runtime behavior.

Tool	Without it	With it
Browser	UI breakage is missed	Interaction and state transitions are verified
Logs	Root cause is guessed	Failure is reconstructed by time and event
Tests	Small edits cause regressions	Repeated failures are caught
Metrics	Production quality decay is late	Results connect to operating data

4. Separate Evaluation Loops

Anthropic's main lesson is not "use many roles." It is separate bias.

Evaluation becomes engineering when:

builder and evaluator have different inputs;
the evaluator asks what was proven, not what was built;
QA checks actual execution instead of code explanation.

5. Design Approval Boundaries

A harness is also a control system.

approval_policy:
  auto:
    - "docs edits"
    - "local refactors"
  review_required:
    - "user-facing UI changes"
    - "test changes"
  human_gate:
    - "production deploy"
    - "database migration"
    - "permission, billing, or policy changes"

This defines the blast radius the system can tolerate.

6. Treat Auto Approval as Policy

More permission prompts do not automatically make a system safer. Anthropic's auto mode work shows that approval fatigue turns safety into a classifier and policy design problem.

Layer	Design question
Safe allowlist	Which actions are almost always safe, such as read-only exploration?
Project-local edits	Which repo-local edits are safe because they are version controlled?
Classifier gate	Which shell, web, external-tool, subagent, or out-of-repo actions require classification?
Trust boundary	Which GitHub orgs, buckets, APIs, or domains count as internal infrastructure?
Deny and continue	After denial, should the agent stop, find a safer route, or escalate?
Hard human gate	Which production, destructive, or security changes must never be auto-classified?

The classifier reduces approval fatigue. It does not remove human review for high-risk work.

7. Separate Harness and Execution Compute

OpenAI Agents SDK and Anthropic Managed Agents both point toward a control-plane/data-plane split. The harness controls the loop; sandbox compute executes generated code and tool calls.

Execution piece	Harness question
Shell	Which commands are allowed and when is approval required?
Filesystem / `apply_patch`	Which paths can be edited and how is patch scope constrained?
Skills	Which knowledge bundles are loaded only when needed?
Memory / Compaction	Where does long-task state live and when is it compressed?
Manifest / mounted data	How are input data, output paths, and dependencies made predictable?
MCP / tunnel	Which internal tools are opened through which network boundary?

Anthropic Managed Agents describes the same issue as separating session, harness, and sandbox. The session is an append-only event log, not the model context window. The harness can wake up from the session log, and external hands are exposed as execute(name, input) -> string. Credentials should live behind a vault or MCP proxy, not inside generated sandbox code.

8. Build in Garbage Collection

Harnesses decay over time.

Drift	Symptom
Doc drift	AGENTS no longer matches the codebase
Loop bloat	Unused reviewer steps remain
Approval bypass	Human gates are ignored under pressure
Tool aging	Browser or log scripts break silently

A harness without cleanup becomes ritual.

Minimum Harness Architecture

Layer	Artifact	Automation	Failure prevented
Input	`AGENTS.md`, invariants	file search, task contract	Wrong start, missing rules
Execution	plan, diff, command	editor, shell, workflow	Scope drift, unsupported implementation
Verification	test, browser QA, logs	runners, browser, observability	"Looks correct" failures
Record	QA report, updates	template, PR check	repeated failure, weak handoff

Key takeaways

A harness is engineering because it handles eight concerns: input, externalized state, tool permissions, evaluation, human handoff, cleanup, harness/compute separation, and auto-approval policy.
It changes failure modes, reducing outcome variance, bounding blast radius, and making work reproducible from files and loops rather than personal taste.
A task contract compresses the starting input shared by planner, builder, and evaluator, and external state files enable handoff and post-failure reconstruction.
Approval is a control system: an allowlist, project-local edits, classifier gates, trust boundaries, deny-and-continue, and hard human gates.
Managed Agents separates session (append-only log), harness, and sandbox, with credentials behind a vault or MCP proxy, not inside generated code.

Harness engineering can sound abstract because "create a better environment" hides the actual system design work.

As engineering, a harness handles eight things.

What input starts the task.
How work state is externalized.
Which tools are opened under which permission boundary.
Where quality is judged.
When the system hands work to a human.
How decayed rules get removed.
How the harness is separated from execution compute.
How auto-approval and auto-denial are delegated to policy or classifiers.

Why It Counts as Engineering

A harness changes the system's failure modes.

What it changes	Why this is engineering
Variance	Same model, less outcome spread
Blast radius	Bad execution is bounded
Observability	Failures are traceable through browser, logs, and tests
Reproducibility	Work depends on files, commands, docs, and loops, not personal taste
Operating cost	Fewer repeated failures, missed reviews, and stale docs

Core View

A harness does not ask the model to be smarter. It designs the system so model mistakes are less damaging.

1. Design the Input Layer

Good harnesses start differently. OpenAI emphasizes AGENTS.md and structured docs/; Anthropic uses a sprint contract. Both compress the starting input.

task_contract:
  goal: "What must be finished"
  non_goals:
    - "What this turn must not do"
  constraints:
    - "Rules that must not be broken"
  files_to_read:
    - "AGENTS.md"
    - "docs/architecture.md"
  validation:
    - "lint"
    - "browser QA"
  escalation:
    - "Ask for human approval if schema changes appear"

This is contract data shared by planner, builder, and evaluator.

2. Externalize State

Long agent work needs external state more than memory.

task-contract.yaml

plan.md

review-notes.md

qa-report.md

release-checklist.md

architecture.md

runbooks.md

invariants.md

External state allows handoff, evaluation against artifacts, and post-failure reconstruction.

3. Tool Access Changes Quality

OpenAI treats browser, logs, and metrics as part of the harness because text-only agents can inspect intent but not runtime behavior.

Tool	Without it	With it
Browser	UI breakage is missed	Interaction and state transitions are verified
Logs	Root cause is guessed	Failure is reconstructed by time and event
Tests	Small edits cause regressions	Repeated failures are caught
Metrics	Production quality decay is late	Results connect to operating data

4. Separate Evaluation Loops

Anthropic's main lesson is not "use many roles." It is separate bias.

Evaluation becomes engineering when:

builder and evaluator have different inputs;
the evaluator asks what was proven, not what was built;
QA checks actual execution instead of code explanation.

5. Design Approval Boundaries

A harness is also a control system.

approval_policy:
  auto:
    - "docs edits"
    - "local refactors"
  review_required:
    - "user-facing UI changes"
    - "test changes"
  human_gate:
    - "production deploy"
    - "database migration"
    - "permission, billing, or policy changes"

This defines the blast radius the system can tolerate.

6. Treat Auto Approval as Policy

More permission prompts do not automatically make a system safer. Anthropic's auto mode work shows that approval fatigue turns safety into a classifier and policy design problem.

Layer	Design question
Safe allowlist	Which actions are almost always safe, such as read-only exploration?
Project-local edits	Which repo-local edits are safe because they are version controlled?
Classifier gate	Which shell, web, external-tool, subagent, or out-of-repo actions require classification?
Trust boundary	Which GitHub orgs, buckets, APIs, or domains count as internal infrastructure?
Deny and continue	After denial, should the agent stop, find a safer route, or escalate?
Hard human gate	Which production, destructive, or security changes must never be auto-classified?

The classifier reduces approval fatigue. It does not remove human review for high-risk work.

7. Separate Harness and Execution Compute

OpenAI Agents SDK and Anthropic Managed Agents both point toward a control-plane/data-plane split. The harness controls the loop; sandbox compute executes generated code and tool calls.

Execution piece	Harness question
Shell	Which commands are allowed and when is approval required?
Filesystem / `apply_patch`	Which paths can be edited and how is patch scope constrained?
Skills	Which knowledge bundles are loaded only when needed?
Memory / Compaction	Where does long-task state live and when is it compressed?
Manifest / mounted data	How are input data, output paths, and dependencies made predictable?
MCP / tunnel	Which internal tools are opened through which network boundary?

8. Build in Garbage Collection

Harnesses decay over time.

Drift	Symptom
Doc drift	AGENTS no longer matches the codebase
Loop bloat	Unused reviewer steps remain
Approval bypass	Human gates are ignored under pressure
Tool aging	Browser or log scripts break silently

A harness without cleanup becomes ritual.

Minimum Harness Architecture

Layer	Artifact	Automation	Failure prevented
Input	`AGENTS.md`, invariants	file search, task contract	Wrong start, missing rules
Execution	plan, diff, command	editor, shell, workflow	Scope drift, unsupported implementation
Verification	test, browser QA, logs	runners, browser, observability	"Looks correct" failures
Record	QA report, updates	template, PR check	repeated failure, weak handoff

Why It Counts as Engineering

1. Design the Input Layer

2. Externalize State

3. Tool Access Changes Quality

4. Separate Evaluation Loops

5. Design Approval Boundaries

6. Treat Auto Approval as Policy

7. Separate Harness and Execution Compute

8. Build in Garbage Collection

Minimum Harness Architecture

On This Page

Engineering Mechanics

Why It Counts as Engineering

1. Design the Input Layer

2. Externalize State

3. Tool Access Changes Quality

4. Separate Evaluation Loops

5. Design Approval Boundaries

6. Treat Auto Approval as Policy

7. Separate Harness and Execution Compute

8. Build in Garbage Collection

Minimum Harness Architecture

On This Page