Team Harness Design Checklist
A practical checklist teams can use to design repository, approval, evaluation, browser, log, release, and runtime loops.
Key takeaways
- This is a scored checklist across eight areas: repo/docs, roles, evaluation, tools, HITL, sandbox boundaries, workflow distribution, and operations.
- Score each area 0 (in people's heads), 1 (partial/inconsistent), or 2 (shared, reusable routine) for a total out of 16.
- Totals map to stages: 0-5 personal habit, 6-11 early practical, 12-16 mature, each with a concrete next action.
- A minimum harness package ships AGENTS.md, docs/, a task contract, hooks, and a provider-setup plugin around a read to implement to verify to update loop.
- Suspect overdesign when rule files grow unread, reviewers rarely catch new failures, or tiny tasks always require the full loop.
Harness design is less about philosophy and more about removing the bottlenecks agents actually hit.
Quick Scoring
Score each area from 0 to 2.
| Score | Meaning |
|---|---|
| 0 | Mostly absent or only in people's heads |
| 1 | Partially documented or automated, but inconsistent |
| 2 | Shared team routine and reusable |
1. Repository and Docs
- What is the first document the agent should read?
- Is
AGENTS.mdor an equivalent entry doc short and current? - Are architecture, domain rules, and release criteria in the repo?
- Are important rules only in chat, Notion, or meetings?
- Are owners and links clear?
2. Roles and Boundaries
- Which roles are actually needed: planner, builder, reviewer, QA?
- Which failure requires role separation?
- Are role inputs and outputs clear?
- Does the reviewer find concrete failure modes instead of giving generic approval?
3. Evaluation and Done Criteria
- What means "done"?
- Are lint, typecheck, build, tests, and browser QA required where relevant?
- Are domain gates defined, such as payment correctness or migration safety?
- Are criteria automated or only written in a prompt?
4. Tool Access
- Can the agent access needed logs, metrics, traces, and browsers?
- Is repository context enough, or does the task require external systems?
- Are UI products being modified without browser verification?
5. HITL and Approval
- Which tasks can proceed automatically?
- Which destructive, deployment, or security-sensitive changes require approval?
- Are approval criteria consistent across people?
- What must be re-verified after approval?
- If auto approval is used, are trust boundaries, block rules, and allow exceptions configured?
- Does denial lead to a safer route instead of a workaround?
- Are high-risk infra, payment, and security changes hard human gates?
6. Sandbox and Integration Boundaries
- Which tools are open by default: shell, filesystem,
apply_patch, browser, MCP? - Are sensitive data and production credentials separated from generated code?
- Is private or on-prem MCP connected without public exposure?
- Do hooks automate validation, logging, and memory creation?
- Are humans reachable for long-running remote approval?
- Are session log, harness, and sandbox separated?
- Do subagent handoffs need separate checks?
7. Workflow Distribution
- Are good team practices distributed as commands, skills, templates, hooks, or plugins?
- Are expert routines still passed by word of mouth?
- Can new team members reach similar baseline quality?
- Can provider setup, API keys, and troubleshooting become plugins?
- Do domain templates include skills, connectors, subagents, and approval flows?
8. Operations and Cleanup
- Is there a stale-doc routine?
- Are unused commands and skills removed regularly?
- Are scaffolding steps reduced when models improve?
- Does an updates log explain why the harness changed?
Score Sheet
| Area | 0 points | 1 point | 2 points |
|---|---|---|---|
| Entry docs | None | Long or stale | Short, current, linked |
| Role split | Builder-only | Reviewer exists | Planner/reviewer/QA separated when needed |
| Done criteria | Human judgment | Partial checklist | Automation plus clear approval boundary |
| Tool access | No logs/browser | Partial access | Browser, logs, tests connected |
| Approval boundary | Personal judgment | Some approvals | Human gate, classifier, deny rules clear |
| Execution isolation | No boundary | Some sandbox/MCP | Session, sandbox, MCP, hooks, approvals policy |
| Workflow rollout | Personal routine | Some commands/templates | Commands, skills, templates, plugins adopted |
| Updates/cleanup | None | Occasional cleanup | Cadence and owner |
Total Score
| Total out of 16 | Interpretation | Next action |
|---|---|---|
| 0-5 | Personal habit stage | Entry doc, required checks, updates |
| 6-11 | Early practical stage | Release gate, browser QA, domain layer, sandbox boundary |
| 12-16 | Mature stage | Telemetry, garbage collection, generated harness review |
Minimum Harness Package
AGENTS.md
architecture.md
invariants.md
task-contract.yaml
qa-report.md
updates.md
hooks.json
provider-setup.md
Required loop:
read AGENTS/docs -> implement -> verify -> update log- Add browser QA for UI work.
- Separate human gates for risky changes.
- Fix MCP allowlist and sandbox permissions before using external tools.
- Store trust boundaries and deny rules when auto approval is enabled.
Overdesign Check
Suspect overdesign if two or more are true:
- rule files keep growing and nobody reads them;
- reviewer steps rarely catch new failures;
- tiny tasks always require the full loop;
- there are many commands but no one knows which to use;
- approval exists but criteria are empty.
Team Harness Rollout Strategy
Use Toss, gstack, revfactory, OpenAI, and Anthropic patterns to scale personal routines into a team execution system.
Operations: Entropy and Garbage Collection
Explain how harnesses decay and how teams keep docs, workflows, permissions, hooks, and runtime surfaces current.