The Five Elements of a Harness

Explain the five design axes behind most practical harnesses: environment, roles, criteria, loops, and maintenance.

Key takeaways

Most harnesses reduce to five design axes: environment, roles, criteria, loops, and maintenance.
Environment is about making necessary material easy to find (short entry doc, versioned rules, verification tools), and roles matter less in count than in their contracts.
Criteria should favor machine-checkable gates over abstract quality language, and loops define where the system returns after failure and when humans intervene.
Maintenance treats decay as inevitable, needing update logs, stale-doc checks, and cleanup of unused skills and broken commands.
The best starting point is not to maximize all five but to remove one major failure mode from each.

Most harnesses use different words, but they tend to reduce to five design axes.

1. Environment

Agents reason most reliably inside repositories and connected tools. The goal is not to show everything. The goal is to make the necessary material easy to find.

Good environments have:

a short entry document that points to deeper docs;
versioned architecture and domain rules;
access to verification tools such as browser, logs, metrics, and tests;
team rules inside the repo, not only in chat or meetings.

Key questions:

Where should the agent start?
Which document is current truth?
Which tool verifies the result?

2. Roles

Long tasks degrade when one agent plans, builds, reviews, and QA-checks everything. Harnesses often separate roles.

Role	Responsibility
Planner	Define scope, decomposition, and done conditions
Builder / Generator	Implement
Reviewer / Evaluator	Check requirements, quality, and omissions
QA / Browser Agent	Verify real UI and behavior
Release / Ops	Tests, deployment, rollback, monitoring

The number of roles matters less than the contract between them.

3. Criteria

Agents overestimate completion when done criteria are vague. A good harness makes completion explicit.

Required tests, lint, typecheck, and build.
Specific UX scenarios reproduced in browser.
Security, schema, or performance constraints.
Human approval zones.
Docs, release notes, or runbooks updated.

Practical Rule

Increase machine-checkable criteria before adding abstract quality language.

4. Loops

A harness is not a one-shot generation system. It is a loop: plan -> implement -> evaluate -> fix -> re-verify.

The loop must answer:

Where does the system return after failure?
Who interprets the evaluation result?
When does automation continue, and when does a human intervene?

Anthropic's planner/generator/evaluator framing clarifies this loop. OpenAI's browser/log/metric access shortens it.

5. Maintenance

Harnesses decay.

Old docs remain.
Unused commands and rules pile up.
New domain needs do not get reflected.
Quality gates no longer match the codebase.

So a harness needs operations:

update logs;
stale-doc checks;
unused skill, broken command, and dead-link cleanup;
approval and test-gate review.

The Five Elements in One Table

Element	Question	Typical artifacts
Environment	What can the agent see?	AGENTS.md, docs, schemas, tool connections
Roles	Who is responsible for what?	Planner, reviewer, QA definitions, slash commands
Criteria	What must pass?	Test matrix, gates, quality checklist
Loops	How does the system improve?	Review loop, QA loop, HITL flow
Maintenance	How does it stay current?	Updates, doc gardening, cleanup cadence

Source Emphasis

Source	Environment	Roles	Criteria	Loops	Maintenance
OpenAI	Very strong	Medium	Medium	Medium	Very strong
Anthropic	Medium	Very strong	Very strong	Very strong	Medium
Toss	Strong	Medium	Strong	Strong	Strong
gstack	Strong	Very strong	Strong	Very strong	Medium
revfactory/harness	Strong	Strong	Medium	Strong	Medium

Minimum vs Mature Harness

Stage	Characteristics
Minimum	Short entry doc, a few required checks, basic approval policy
Practical	Role separation, browser/log verification, update log, checklist
Mature	Domain rules, evaluation automation, drift management, release integration

The best starting point is not to maximize all five elements. It is to remove one major failure mode from each.

Key takeaways

Most harnesses reduce to five design axes: environment, roles, criteria, loops, and maintenance.
Environment is about making necessary material easy to find (short entry doc, versioned rules, verification tools), and roles matter less in count than in their contracts.
Criteria should favor machine-checkable gates over abstract quality language, and loops define where the system returns after failure and when humans intervene.
Maintenance treats decay as inevitable, needing update logs, stale-doc checks, and cleanup of unused skills and broken commands.
The best starting point is not to maximize all five but to remove one major failure mode from each.

Most harnesses use different words, but they tend to reduce to five design axes.

1. Environment

Agents reason most reliably inside repositories and connected tools. The goal is not to show everything. The goal is to make the necessary material easy to find.

Good environments have:

a short entry document that points to deeper docs;
versioned architecture and domain rules;
access to verification tools such as browser, logs, metrics, and tests;
team rules inside the repo, not only in chat or meetings.

Key questions:

Where should the agent start?
Which document is current truth?
Which tool verifies the result?

2. Roles

Long tasks degrade when one agent plans, builds, reviews, and QA-checks everything. Harnesses often separate roles.

Role	Responsibility
Planner	Define scope, decomposition, and done conditions
Builder / Generator	Implement
Reviewer / Evaluator	Check requirements, quality, and omissions
QA / Browser Agent	Verify real UI and behavior
Release / Ops	Tests, deployment, rollback, monitoring

The number of roles matters less than the contract between them.

3. Criteria

Agents overestimate completion when done criteria are vague. A good harness makes completion explicit.

Required tests, lint, typecheck, and build.
Specific UX scenarios reproduced in browser.
Security, schema, or performance constraints.
Human approval zones.
Docs, release notes, or runbooks updated.

Practical Rule

Increase machine-checkable criteria before adding abstract quality language.

4. Loops

A harness is not a one-shot generation system. It is a loop: plan -> implement -> evaluate -> fix -> re-verify.

The loop must answer:

Where does the system return after failure?
Who interprets the evaluation result?
When does automation continue, and when does a human intervene?

Anthropic's planner/generator/evaluator framing clarifies this loop. OpenAI's browser/log/metric access shortens it.

5. Maintenance

Harnesses decay.

Old docs remain.
Unused commands and rules pile up.
New domain needs do not get reflected.
Quality gates no longer match the codebase.

So a harness needs operations:

update logs;
stale-doc checks;
unused skill, broken command, and dead-link cleanup;
approval and test-gate review.

The Five Elements in One Table

Element	Question	Typical artifacts
Environment	What can the agent see?	AGENTS.md, docs, schemas, tool connections
Roles	Who is responsible for what?	Planner, reviewer, QA definitions, slash commands
Criteria	What must pass?	Test matrix, gates, quality checklist
Loops	How does the system improve?	Review loop, QA loop, HITL flow
Maintenance	How does it stay current?	Updates, doc gardening, cleanup cadence

Source Emphasis

Source	Environment	Roles	Criteria	Loops	Maintenance
OpenAI	Very strong	Medium	Medium	Medium	Very strong
Anthropic	Medium	Very strong	Very strong	Very strong	Medium
Toss	Strong	Medium	Strong	Strong	Strong
gstack	Strong	Very strong	Strong	Very strong	Medium
revfactory/harness	Strong	Strong	Medium	Strong	Medium

Minimum vs Mature Harness

Stage	Characteristics
Minimum	Short entry doc, a few required checks, basic approval policy
Practical	Role separation, browser/log verification, update log, checklist
Mature	Domain rules, evaluation automation, drift management, release integration

The best starting point is not to maximize all five elements. It is to remove one major failure mode from each.

1. Environment

2. Roles

3. Criteria

4. Loops

5. Maintenance

The Five Elements in One Table

Source Emphasis

Minimum vs Mature Harness

On This Page

The Five Elements of a Harness

1. Environment

2. Roles

3. Criteria

4. Loops

5. Maintenance

The Five Elements in One Table

Source Emphasis

Minimum vs Mature Harness

On This Page