The Five Elements of a Harness
Explain the five design axes behind most practical harnesses: environment, roles, criteria, loops, and maintenance.
Key takeaways
- Most harnesses reduce to five design axes: environment, roles, criteria, loops, and maintenance.
- Environment is about making necessary material easy to find (short entry doc, versioned rules, verification tools), and roles matter less in count than in their contracts.
- Criteria should favor machine-checkable gates over abstract quality language, and loops define where the system returns after failure and when humans intervene.
- Maintenance treats decay as inevitable, needing update logs, stale-doc checks, and cleanup of unused skills and broken commands.
- The best starting point is not to maximize all five but to remove one major failure mode from each.
Most harnesses use different words, but they tend to reduce to five design axes.
1. Environment
Agents reason most reliably inside repositories and connected tools. The goal is not to show everything. The goal is to make the necessary material easy to find.
Good environments have:
- a short entry document that points to deeper docs;
- versioned architecture and domain rules;
- access to verification tools such as browser, logs, metrics, and tests;
- team rules inside the repo, not only in chat or meetings.
Key questions:
- Where should the agent start?
- Which document is current truth?
- Which tool verifies the result?
2. Roles
Long tasks degrade when one agent plans, builds, reviews, and QA-checks everything. Harnesses often separate roles.
| Role | Responsibility |
|---|---|
| Planner | Define scope, decomposition, and done conditions |
| Builder / Generator | Implement |
| Reviewer / Evaluator | Check requirements, quality, and omissions |
| QA / Browser Agent | Verify real UI and behavior |
| Release / Ops | Tests, deployment, rollback, monitoring |
The number of roles matters less than the contract between them.
3. Criteria
Agents overestimate completion when done criteria are vague. A good harness makes completion explicit.
- Required tests, lint, typecheck, and build.
- Specific UX scenarios reproduced in browser.
- Security, schema, or performance constraints.
- Human approval zones.
- Docs, release notes, or runbooks updated.
Practical Rule
Increase machine-checkable criteria before adding abstract quality language.
4. Loops
A harness is not a one-shot generation system. It is a loop: plan -> implement -> evaluate -> fix -> re-verify.
The loop must answer:
- Where does the system return after failure?
- Who interprets the evaluation result?
- When does automation continue, and when does a human intervene?
Anthropic's planner/generator/evaluator framing clarifies this loop. OpenAI's browser/log/metric access shortens it.
5. Maintenance
Harnesses decay.
- Old docs remain.
- Unused commands and rules pile up.
- New domain needs do not get reflected.
- Quality gates no longer match the codebase.
So a harness needs operations:
- update logs;
- stale-doc checks;
- unused skill, broken command, and dead-link cleanup;
- approval and test-gate review.
The Five Elements in One Table
| Element | Question | Typical artifacts |
|---|---|---|
| Environment | What can the agent see? | AGENTS.md, docs, schemas, tool connections |
| Roles | Who is responsible for what? | Planner, reviewer, QA definitions, slash commands |
| Criteria | What must pass? | Test matrix, gates, quality checklist |
| Loops | How does the system improve? | Review loop, QA loop, HITL flow |
| Maintenance | How does it stay current? | Updates, doc gardening, cleanup cadence |
Source Emphasis
| Source | Environment | Roles | Criteria | Loops | Maintenance |
|---|---|---|---|---|---|
| OpenAI | Very strong | Medium | Medium | Medium | Very strong |
| Anthropic | Medium | Very strong | Very strong | Very strong | Medium |
| Toss | Strong | Medium | Strong | Strong | Strong |
| gstack | Strong | Very strong | Strong | Very strong | Medium |
| revfactory/harness | Strong | Strong | Medium | Strong | Medium |
Minimum vs Mature Harness
| Stage | Characteristics |
|---|---|
| Minimum | Short entry doc, a few required checks, basic approval policy |
| Practical | Role separation, browser/log verification, update log, checklist |
| Mature | Domain rules, evaluation automation, drift management, release integration |
The best starting point is not to maximize all five elements. It is to remove one major failure mode from each.