Name: Harness Engineering
Author: reopt

A practical guide to harness design, evaluation, and operations based on OpenAI, Anthropic, Toss, gstack, revfactory, Agents SDK, and Managed Agents patterns

Harness engineering is not about writing a better prompt. It is the practice of designing the work environment that lets agents operate for longer, larger, and riskier tasks.

Teams using the same model and the same IDE can see very different outcomes because their context structure, verification loop, approval boundary, documentation quality, and tool access differ.

Core Thesis

Generic harnesses are useful starting points, but performance depends on how explicitly your team externalizes domain rules, operating criteria, and runtime boundaries.

English Edition

This edition translates and localizes the Korean handbook for platform, AgentOps, and developer productivity teams standardizing AI coding agents across repositories.

Source Map

Source	Question this handbook takes from it	Summary
OpenAI `Harness Engineering`	Where does the agent work?	The repository, docs, browser, logs, and cleanup loop are part of the harness
OpenAI Agents SDK / Codex updates	Which harness primitives are becoming product surfaces?	MCP, skills, AGENTS.md, sandbox, shell, apply_patch, hooks, and plugins are becoming common infrastructure
Anthropic harness / Managed Agents / auto mode	How is an agent verified, isolated, and approved?	Separate planner/evaluator, session, harness, sandbox, and permission classifier boundaries
Toss harness article	How does a harness roll out to a team?	Personal habits must become executable SSOT and workflow
gstack	How does a harness become a workflow across many agent hosts?	Think -> Plan -> Build -> Review -> Test -> Ship -> Reflect becomes a command surface
revfactory/harness	How can harness design become repeatable?	Domain analysis generates agent teams, skills, and validation loops

Where Harness Begins

A prompt improves a single response.

It clarifies the goal.
It constrains the output format.
It improves one model call.

Context improves the material the model can use.

Which files to read.
Which docs to trust.
Which local rules to prioritize.

A harness improves the whole work system.

Planning, implementation, review, QA, approval, and release.
Browser, logs, tests, and repository docs.
A loop to recover when the first attempt fails.

A platform distributes harnesses across teams.

Shared skills, commands, templates, and plugins.
Domain rule layers.
Update logs, metrics, and garbage collection.

Questions This Book Answers

How is harness engineering different from prompt engineering and context engineering?
What elements make a harness effective?
Why are inputs, state, verification, and permissions engineering concerns?
What do OpenAI, Anthropic, Toss, gstack, and revfactory each emphasize?
How do Agents SDK, Managed Agents, sandboxing, MCP, skills, hooks, and plugins change team design?
Why should teams converge toward their own harness instead of copying someone else's?
What order should a team use to design, roll out, and operate a harness?

Who This Is For

Reader	What you get
AI coding agent adoption lead	A way to turn personal tricks into a team system
Codex, Claude Code, Cursor, or agentic IDE user	A view beyond tool usage into work-environment design
AgentOps or platform engineer	An operating frame for evaluation, approval, observability, and docs
Team defining internal AI standards	A method for executable SSOT and common workflows

Five-Minute Diagnostic

Current pain	Start here
Same model, very different team outcomes	`foundations` -> `engineering-mechanics`
Lots of prompts, weak repeatability	`five-elements` -> `engineering-mechanics`
Review and QA catch problems too late	`evaluation-loops` -> `case-anthropic`
Copied an external harness and it does not fit	`case-studies` -> `make-it-yours`
Docs, approvals, and browser checks are disconnected	`case-openai` -> `checklist`

Maturity Map

Recommended Paths

Goal	Reading path
Understand the concept quickly	`foundations` -> `engineering-mechanics` -> `five-elements`
Compare external examples	`case-studies` -> `case-openai` -> `case-anthropic`
Apply it to a frontend team	`domain-playbooks` -> `scenario-frontend-team`
Build platform or monorepo rules	`domain-playbooks` -> `scenario-platform-team`
Manage payments or settlement risk	`domain-playbooks` -> `scenario-payments-team`
Operate AI product evaluation and rollout	`domain-playbooks` -> `scenario-ai-product-team`
Roll out to a team	`case-toss` -> `team-rollout`
Study workflow and release gates	`case-gstack` -> `operations`
Study meta-harness generation	`case-revfactory` -> `make-it-yours`

Ch1. Foundations

Define harness engineering and separate it from prompt and context engineering.

Ch2. Repo-Readable Systems

Make AGENTS.md, docs, observability, and executable SSOT part of the work environment.

Ch3. The Five Elements

Environment, roles, criteria, loops, and maintenance.

Ch4. Engineering Mechanics

Treat inputs, state, tools, evaluation, approval, sandboxing, and cleanup as system design.

Ch5. Evaluation Loops

Decide when planner, builder, evaluator, and QA should be separated.

Ch6. Case Comparison

Compare OpenAI, Anthropic, Toss, gstack, and revfactory on the same axes.

Ch7. OpenAI

Repo-readable systems, Agents SDK harnesses, sandboxing, observability, and cleanup.

Ch8. Anthropic

Load-bearing scaffolding, managed runtime, and auto approval classifier boundaries.

Ch9. Toss

Frictionless harnesses, executable SSOT, and domain HITL.

Ch10. gstack

A strong opinionated workflow across many AI coding agent hosts.

Ch11. revfactory/harness

Harness generation as a team architecture process.

Ch12. Domain Playbooks

Translate harness design into frontend, platform, payments, and AI product contexts.

Appendix. Verification Report

Source, structure, and build verification baseline.

Appendix. Updates

Track interpretation changes and source evidence.

Recommended Cross-Reads

/en/books/llmops-agentops: production operations for AI systems
/en/books/codex-advanced, /en/books/claude-code-advanced: tool-specific implementation practices
/ko/books/agent-orchestration-patterns: multi-agent design patterns (Korean)

Harness engineering is not about writing a better prompt. It is the practice of designing the work environment that lets agents operate for longer, larger, and riskier tasks.

Teams using the same model and the same IDE can see very different outcomes because their context structure, verification loop, approval boundary, documentation quality, and tool access differ.

Core Thesis

Generic harnesses are useful starting points, but performance depends on how explicitly your team externalizes domain rules, operating criteria, and runtime boundaries.

English Edition

This edition translates and localizes the Korean handbook for platform, AgentOps, and developer productivity teams standardizing AI coding agents across repositories.

Source Map

Source	Question this handbook takes from it	Summary
OpenAI `Harness Engineering`	Where does the agent work?	The repository, docs, browser, logs, and cleanup loop are part of the harness
OpenAI Agents SDK / Codex updates	Which harness primitives are becoming product surfaces?	MCP, skills, AGENTS.md, sandbox, shell, apply_patch, hooks, and plugins are becoming common infrastructure
Anthropic harness / Managed Agents / auto mode	How is an agent verified, isolated, and approved?	Separate planner/evaluator, session, harness, sandbox, and permission classifier boundaries
Toss harness article	How does a harness roll out to a team?	Personal habits must become executable SSOT and workflow
gstack	How does a harness become a workflow across many agent hosts?	Think -> Plan -> Build -> Review -> Test -> Ship -> Reflect becomes a command surface
revfactory/harness	How can harness design become repeatable?	Domain analysis generates agent teams, skills, and validation loops

Where Harness Begins

A prompt improves a single response.

It clarifies the goal.
It constrains the output format.
It improves one model call.

Context improves the material the model can use.

Which files to read.
Which docs to trust.
Which local rules to prioritize.

A harness improves the whole work system.

Planning, implementation, review, QA, approval, and release.
Browser, logs, tests, and repository docs.
A loop to recover when the first attempt fails.

A platform distributes harnesses across teams.

Shared skills, commands, templates, and plugins.
Domain rule layers.
Update logs, metrics, and garbage collection.

Questions This Book Answers

How is harness engineering different from prompt engineering and context engineering?
What elements make a harness effective?
Why are inputs, state, verification, and permissions engineering concerns?
What do OpenAI, Anthropic, Toss, gstack, and revfactory each emphasize?
How do Agents SDK, Managed Agents, sandboxing, MCP, skills, hooks, and plugins change team design?
Why should teams converge toward their own harness instead of copying someone else's?
What order should a team use to design, roll out, and operate a harness?

Who This Is For

Reader	What you get
AI coding agent adoption lead	A way to turn personal tricks into a team system
Codex, Claude Code, Cursor, or agentic IDE user	A view beyond tool usage into work-environment design
AgentOps or platform engineer	An operating frame for evaluation, approval, observability, and docs
Team defining internal AI standards	A method for executable SSOT and common workflows

Five-Minute Diagnostic

Current pain	Start here
Same model, very different team outcomes	`foundations` -> `engineering-mechanics`
Lots of prompts, weak repeatability	`five-elements` -> `engineering-mechanics`
Review and QA catch problems too late	`evaluation-loops` -> `case-anthropic`
Copied an external harness and it does not fit	`case-studies` -> `make-it-yours`
Docs, approvals, and browser checks are disconnected	`case-openai` -> `checklist`

Maturity Map

Recommended Paths

Goal	Reading path
Understand the concept quickly	`foundations` -> `engineering-mechanics` -> `five-elements`
Compare external examples	`case-studies` -> `case-openai` -> `case-anthropic`
Apply it to a frontend team	`domain-playbooks` -> `scenario-frontend-team`
Build platform or monorepo rules	`domain-playbooks` -> `scenario-platform-team`
Manage payments or settlement risk	`domain-playbooks` -> `scenario-payments-team`
Operate AI product evaluation and rollout	`domain-playbooks` -> `scenario-ai-product-team`
Roll out to a team	`case-toss` -> `team-rollout`
Study workflow and release gates	`case-gstack` -> `operations`
Study meta-harness generation	`case-revfactory` -> `make-it-yours`

/en/books/llmops-agentops: production operations for AI systems
/en/books/codex-advanced, /en/books/claude-code-advanced: tool-specific implementation practices
/ko/books/agent-orchestration-patterns: multi-agent design patterns (Korean)

Harness Engineering

Recently Updated Chapters

Ch1. Foundations

Ch2. Repo-Readable Systems

Ch3. The Five Elements

Ch4. Engineering Mechanics

Ch5. Evaluation Loops

Ch6. Case Comparison

Ch7. OpenAI

Ch8. Anthropic

Ch9. Toss

Ch10. gstack

Ch11. revfactory/harness

Ch12. Domain Playbooks

Appendix. Verification Report

Appendix. Updates

On This Page

Harness Engineering

Recently Updated Chapters

Ch1. Foundations

Ch2. Repo-Readable Systems

Ch3. The Five Elements

Ch4. Engineering Mechanics

Ch5. Evaluation Loops

Ch6. Case Comparison

Ch7. OpenAI

Ch8. Anthropic

Ch9. Toss

Ch10. gstack

Ch11. revfactory/harness

Ch12. Domain Playbooks

Appendix. Verification Report

Appendix. Updates

On This Page