Team Harness Design Checklist

A practical checklist teams can use to design repository, approval, evaluation, browser, log, release, and runtime loops.

Key takeaways

This is a scored checklist across eight areas: repo/docs, roles, evaluation, tools, HITL, sandbox boundaries, workflow distribution, and operations.
Score each area 0 (in people's heads), 1 (partial/inconsistent), or 2 (shared, reusable routine) for a total out of 16.
Totals map to stages: 0-5 personal habit, 6-11 early practical, 12-16 mature, each with a concrete next action.
A minimum harness package ships AGENTS.md, docs/, a task contract, hooks, and a provider-setup plugin around a read to implement to verify to update loop.
Suspect overdesign when rule files grow unread, reviewers rarely catch new failures, or tiny tasks always require the full loop.

Harness design is less about philosophy and more about removing the bottlenecks agents actually hit.

Quick Scoring

Score each area from 0 to 2.

Score	Meaning
0	Mostly absent or only in people's heads
1	Partially documented or automated, but inconsistent
2	Shared team routine and reusable

1. Repository and Docs

What is the first document the agent should read?
Is AGENTS.md or an equivalent entry doc short and current?
Are architecture, domain rules, and release criteria in the repo?
Are important rules only in chat, Notion, or meetings?
Are owners and links clear?

2. Roles and Boundaries

Which roles are actually needed: planner, builder, reviewer, QA?
Which failure requires role separation?
Are role inputs and outputs clear?
Does the reviewer find concrete failure modes instead of giving generic approval?

3. Evaluation and Done Criteria

What means "done"?
Are lint, typecheck, build, tests, and browser QA required where relevant?
Are domain gates defined, such as payment correctness or migration safety?
Are criteria automated or only written in a prompt?

4. Tool Access

Can the agent access needed logs, metrics, traces, and browsers?
Is repository context enough, or does the task require external systems?
Are UI products being modified without browser verification?

5. HITL and Approval

Which tasks can proceed automatically?
Which destructive, deployment, or security-sensitive changes require approval?
Are approval criteria consistent across people?
What must be re-verified after approval?
If auto approval is used, are trust boundaries, block rules, and allow exceptions configured?
Does denial lead to a safer route instead of a workaround?
Are high-risk infra, payment, and security changes hard human gates?

6. Sandbox and Integration Boundaries

Which tools are open by default: shell, filesystem, apply_patch, browser, MCP?
Are sensitive data and production credentials separated from generated code?
Is private or on-prem MCP connected without public exposure?
Do hooks automate validation, logging, and memory creation?
Are humans reachable for long-running remote approval?
Are session log, harness, and sandbox separated?
Do subagent handoffs need separate checks?

7. Workflow Distribution

Are good team practices distributed as commands, skills, templates, hooks, or plugins?
Are expert routines still passed by word of mouth?
Can new team members reach similar baseline quality?
Can provider setup, API keys, and troubleshooting become plugins?
Do domain templates include skills, connectors, subagents, and approval flows?

8. Operations and Cleanup

Is there a stale-doc routine?
Are unused commands and skills removed regularly?
Are scaffolding steps reduced when models improve?
Does an updates log explain why the harness changed?

Score Sheet

Area	0 points	1 point	2 points
Entry docs	None	Long or stale	Short, current, linked
Role split	Builder-only	Reviewer exists	Planner/reviewer/QA separated when needed
Done criteria	Human judgment	Partial checklist	Automation plus clear approval boundary
Tool access	No logs/browser	Partial access	Browser, logs, tests connected
Approval boundary	Personal judgment	Some approvals	Human gate, classifier, deny rules clear
Execution isolation	No boundary	Some sandbox/MCP	Session, sandbox, MCP, hooks, approvals policy
Workflow rollout	Personal routine	Some commands/templates	Commands, skills, templates, plugins adopted
Updates/cleanup	None	Occasional cleanup	Cadence and owner

Total Score

Total out of 16	Interpretation	Next action
0-5	Personal habit stage	Entry doc, required checks, updates
6-11	Early practical stage	Release gate, browser QA, domain layer, sandbox boundary
12-16	Mature stage	Telemetry, garbage collection, generated harness review

Minimum Harness Package

AGENTS.md

architecture.md

invariants.md

task-contract.yaml

qa-report.md

updates.md

hooks.json

provider-setup.md

Required loop:

read AGENTS/docs -> implement -> verify -> update log
Add browser QA for UI work.
Separate human gates for risky changes.
Fix MCP allowlist and sandbox permissions before using external tools.
Store trust boundaries and deny rules when auto approval is enabled.

Overdesign Check

Suspect overdesign if two or more are true:

rule files keep growing and nobody reads them;
reviewer steps rarely catch new failures;
tiny tasks always require the full loop;
there are many commands but no one knows which to use;
approval exists but criteria are empty.

Team Harness Design Checklist

A practical checklist teams can use to design repository, approval, evaluation, browser, log, release, and runtime loops.

Key takeaways

This is a scored checklist across eight areas: repo/docs, roles, evaluation, tools, HITL, sandbox boundaries, workflow distribution, and operations.
Score each area 0 (in people's heads), 1 (partial/inconsistent), or 2 (shared, reusable routine) for a total out of 16.
Totals map to stages: 0-5 personal habit, 6-11 early practical, 12-16 mature, each with a concrete next action.
A minimum harness package ships AGENTS.md, docs/, a task contract, hooks, and a provider-setup plugin around a read to implement to verify to update loop.
Suspect overdesign when rule files grow unread, reviewers rarely catch new failures, or tiny tasks always require the full loop.

Harness design is less about philosophy and more about removing the bottlenecks agents actually hit.

Quick Scoring

Score each area from 0 to 2.

Score	Meaning
0	Mostly absent or only in people's heads
1	Partially documented or automated, but inconsistent
2	Shared team routine and reusable

1. Repository and Docs

What is the first document the agent should read?
Is AGENTS.md or an equivalent entry doc short and current?
Are architecture, domain rules, and release criteria in the repo?
Are important rules only in chat, Notion, or meetings?
Are owners and links clear?

2. Roles and Boundaries

Which roles are actually needed: planner, builder, reviewer, QA?
Which failure requires role separation?
Are role inputs and outputs clear?
Does the reviewer find concrete failure modes instead of giving generic approval?

3. Evaluation and Done Criteria

What means "done"?
Are lint, typecheck, build, tests, and browser QA required where relevant?
Are domain gates defined, such as payment correctness or migration safety?
Are criteria automated or only written in a prompt?

4. Tool Access

Can the agent access needed logs, metrics, traces, and browsers?
Is repository context enough, or does the task require external systems?
Are UI products being modified without browser verification?

5. HITL and Approval

Which tasks can proceed automatically?
Which destructive, deployment, or security-sensitive changes require approval?
Are approval criteria consistent across people?
What must be re-verified after approval?
If auto approval is used, are trust boundaries, block rules, and allow exceptions configured?
Does denial lead to a safer route instead of a workaround?
Are high-risk infra, payment, and security changes hard human gates?

6. Sandbox and Integration Boundaries

Which tools are open by default: shell, filesystem, apply_patch, browser, MCP?
Are sensitive data and production credentials separated from generated code?
Is private or on-prem MCP connected without public exposure?
Do hooks automate validation, logging, and memory creation?
Are humans reachable for long-running remote approval?
Are session log, harness, and sandbox separated?
Do subagent handoffs need separate checks?

7. Workflow Distribution

Are good team practices distributed as commands, skills, templates, hooks, or plugins?
Are expert routines still passed by word of mouth?
Can new team members reach similar baseline quality?
Can provider setup, API keys, and troubleshooting become plugins?
Do domain templates include skills, connectors, subagents, and approval flows?

8. Operations and Cleanup

Is there a stale-doc routine?
Are unused commands and skills removed regularly?
Are scaffolding steps reduced when models improve?
Does an updates log explain why the harness changed?

Score Sheet

Area	0 points	1 point	2 points
Entry docs	None	Long or stale	Short, current, linked
Role split	Builder-only	Reviewer exists	Planner/reviewer/QA separated when needed
Done criteria	Human judgment	Partial checklist	Automation plus clear approval boundary
Tool access	No logs/browser	Partial access	Browser, logs, tests connected
Approval boundary	Personal judgment	Some approvals	Human gate, classifier, deny rules clear
Execution isolation	No boundary	Some sandbox/MCP	Session, sandbox, MCP, hooks, approvals policy
Workflow rollout	Personal routine	Some commands/templates	Commands, skills, templates, plugins adopted
Updates/cleanup	None	Occasional cleanup	Cadence and owner

Total Score

Total out of 16	Interpretation	Next action
0-5	Personal habit stage	Entry doc, required checks, updates
6-11	Early practical stage	Release gate, browser QA, domain layer, sandbox boundary
12-16	Mature stage	Telemetry, garbage collection, generated harness review

Minimum Harness Package

AGENTS.md

architecture.md

invariants.md

task-contract.yaml

qa-report.md

updates.md

hooks.json

provider-setup.md

Required loop:

read AGENTS/docs -> implement -> verify -> update log
Add browser QA for UI work.
Separate human gates for risky changes.
Fix MCP allowlist and sandbox permissions before using external tools.
Store trust boundaries and deny rules when auto approval is enabled.

Overdesign Check

Suspect overdesign if two or more are true:

rule files keep growing and nobody reads them;
reviewer steps rarely catch new failures;
tiny tasks always require the full loop;
there are many commands but no one knows which to use;
approval exists but criteria are empty.

Quick Scoring

1. Repository and Docs

2. Roles and Boundaries

3. Evaluation and Done Criteria

4. Tool Access

5. HITL and Approval

6. Sandbox and Integration Boundaries

7. Workflow Distribution

8. Operations and Cleanup

Score Sheet

Total Score

Minimum Harness Package

Overdesign Check

On This Page

Team Harness Design Checklist

Quick Scoring

1. Repository and Docs

2. Roles and Boundaries

3. Evaluation and Done Criteria

4. Tool Access

5. HITL and Approval

6. Sandbox and Integration Boundaries

7. Workflow Distribution

8. Operations and Cleanup

Score Sheet

Total Score

Minimum Harness Package

Overdesign Check

On This Page