Harness Engineering
A practical guide to harness design, evaluation, and operations based on OpenAI, Anthropic, Toss, gstack, revfactory, Agents SDK, and Managed Agents patterns
Recently Updated Chapters
Analyze Anthropic's long-running harness through planner/evaluator, Managed Agents, and auto approval patterns.
Read gstack as an opinionated multi-host workflow harness with specialists, power tools, QA, checkpointing, and release gates.
Analyze OpenAI's harness view through repo-readability, observability, sandboxing, runtime surface, and cleanup.
Read revfactory/harness as a meta-harness for generating domain-specific team architectures, agents, and skills.
Compare OpenAI, Anthropic, Toss, gstack, and revfactory/harness by input, state, verification, and rollout.
Harness engineering is not about writing a better prompt. It is the practice of designing the work environment that lets agents operate for longer, larger, and riskier tasks.
Teams using the same model and the same IDE can see very different outcomes because their context structure, verification loop, approval boundary, documentation quality, and tool access differ.
Core Thesis
Generic harnesses are useful starting points, but performance depends on how explicitly your team externalizes domain rules, operating criteria, and runtime boundaries.
English Edition
This edition translates and localizes the Korean handbook for platform, AgentOps, and developer productivity teams standardizing AI coding agents across repositories.
Source Map
| Source | Question this handbook takes from it | Summary |
|---|---|---|
OpenAI Harness Engineering | Where does the agent work? | The repository, docs, browser, logs, and cleanup loop are part of the harness |
| OpenAI Agents SDK / Codex updates | Which harness primitives are becoming product surfaces? | MCP, skills, AGENTS.md, sandbox, shell, apply_patch, hooks, and plugins are becoming common infrastructure |
| Anthropic harness / Managed Agents / auto mode | How is an agent verified, isolated, and approved? | Separate planner/evaluator, session, harness, sandbox, and permission classifier boundaries |
| Toss harness article | How does a harness roll out to a team? | Personal habits must become executable SSOT and workflow |
| gstack | How does a harness become a workflow across many agent hosts? | Think -> Plan -> Build -> Review -> Test -> Ship -> Reflect becomes a command surface |
| revfactory/harness | How can harness design become repeatable? | Domain analysis generates agent teams, skills, and validation loops |
Where Harness Begins
A prompt improves a single response.
- It clarifies the goal.
- It constrains the output format.
- It improves one model call.
Context improves the material the model can use.
- Which files to read.
- Which docs to trust.
- Which local rules to prioritize.
A harness improves the whole work system.
- Planning, implementation, review, QA, approval, and release.
- Browser, logs, tests, and repository docs.
- A loop to recover when the first attempt fails.
A platform distributes harnesses across teams.
- Shared skills, commands, templates, and plugins.
- Domain rule layers.
- Update logs, metrics, and garbage collection.
Questions This Book Answers
- How is harness engineering different from prompt engineering and context engineering?
- What elements make a harness effective?
- Why are inputs, state, verification, and permissions engineering concerns?
- What do OpenAI, Anthropic, Toss, gstack, and revfactory each emphasize?
- How do Agents SDK, Managed Agents, sandboxing, MCP, skills, hooks, and plugins change team design?
- Why should teams converge toward their own harness instead of copying someone else's?
- What order should a team use to design, roll out, and operate a harness?
Who This Is For
| Reader | What you get |
|---|---|
| AI coding agent adoption lead | A way to turn personal tricks into a team system |
| Codex, Claude Code, Cursor, or agentic IDE user | A view beyond tool usage into work-environment design |
| AgentOps or platform engineer | An operating frame for evaluation, approval, observability, and docs |
| Team defining internal AI standards | A method for executable SSOT and common workflows |
Five-Minute Diagnostic
| Current pain | Start here |
|---|---|
| Same model, very different team outcomes | foundations -> engineering-mechanics |
| Lots of prompts, weak repeatability | five-elements -> engineering-mechanics |
| Review and QA catch problems too late | evaluation-loops -> case-anthropic |
| Copied an external harness and it does not fit | case-studies -> make-it-yours |
| Docs, approvals, and browser checks are disconnected | case-openai -> checklist |
Maturity Map
Recommended Paths
| Goal | Reading path |
|---|---|
| Understand the concept quickly | foundations -> engineering-mechanics -> five-elements |
| Compare external examples | case-studies -> case-openai -> case-anthropic |
| Apply it to a frontend team | domain-playbooks -> scenario-frontend-team |
| Build platform or monorepo rules | domain-playbooks -> scenario-platform-team |
| Manage payments or settlement risk | domain-playbooks -> scenario-payments-team |
| Operate AI product evaluation and rollout | domain-playbooks -> scenario-ai-product-team |
| Roll out to a team | case-toss -> team-rollout |
| Study workflow and release gates | case-gstack -> operations |
| Study meta-harness generation | case-revfactory -> make-it-yours |
Contents
Ch1. Foundations
Define harness engineering and separate it from prompt and context engineering.
Ch2. Repo-Readable Systems
Make AGENTS.md, docs, observability, and executable SSOT part of the work environment.
Ch3. The Five Elements
Environment, roles, criteria, loops, and maintenance.
Ch4. Engineering Mechanics
Treat inputs, state, tools, evaluation, approval, sandboxing, and cleanup as system design.
Ch5. Evaluation Loops
Decide when planner, builder, evaluator, and QA should be separated.
Ch6. Case Comparison
Compare OpenAI, Anthropic, Toss, gstack, and revfactory on the same axes.
Ch7. OpenAI
Repo-readable systems, Agents SDK harnesses, sandboxing, observability, and cleanup.
Ch8. Anthropic
Load-bearing scaffolding, managed runtime, and auto approval classifier boundaries.
Ch9. Toss
Frictionless harnesses, executable SSOT, and domain HITL.
Ch10. gstack
A strong opinionated workflow across many AI coding agent hosts.
Ch11. revfactory/harness
Harness generation as a team architecture process.
Ch12. Domain Playbooks
Translate harness design into frontend, platform, payments, and AI product contexts.
Appendix. Verification Report
Source, structure, and build verification baseline.
Appendix. Updates
Track interpretation changes and source evidence.
Related Handbooks
Recommended Cross-Reads
/en/books/llmops-agentops: production operations for AI systems/en/books/codex-advanced,/en/books/claude-code-advanced: tool-specific implementation practices/ko/books/agent-orchestration-patterns: multi-agent design patterns (Korean)