External Case Comparison
Compare OpenAI, Anthropic, Toss, gstack, and revfactory/harness by input, state, verification, and rollout.
Key takeaways
- The useful question across cases is not "who is right?" but which distinct problem each example solves.
- OpenAI tackles search cost and doc entropy (knowledge architecture); Anthropic tackles self-evaluation bias and runtime coupling (control-loop engineering).
- Toss solves team distribution and reproducibility; gstack solves parallel work without chaos; revfactory solves repeating harness design itself.
- The comparison table aligns each case along core input, externalized state, verification interface, and approval/rollout.
- Shared conclusion: a better work environment beats a better single prompt, and long work needs external state, evaluation loops, and operations.
When studying harness engineering, the question is not "who is right?" The useful question is: which problem is each example solving?
OpenAI
Repo-readable systems, observability, runtime surface, and cleanup.
Anthropic
Planner/evaluator, managed runtime, permission classifier, and handoff.
Toss
Executable SSOT, domain layers, and frictionless team rollout.
gstack
Sprint, command surface, QA, and release gate.
revfactory/harness
Domain-first harness generation and team architecture.
Comparison Table
| Case | Core input | Externalized state | Verification interface | Approval / rollout | Strongest message |
|---|---|---|---|---|---|
| OpenAI | AGENTS.md, docs/, MCP, skills, sandbox | Docs, code, observability, workspace manifest | Browser, logs, metrics, hooks | Cleanup, remote approval, Secure MCP, plugins | Repo + runtime surface is the harness |
| Anthropic | Task contract, permission policy | Durable session log, planner/builder/evaluator handoff | Evaluator, QA, permission classifier | Retry budget, handoff, managed runtime | Separate load-bearing scaffolding and runtime boundaries |
| Toss | Global/domain/local rules | Workflow and SSOT | Executable docs and procedures | Domain HITL | Push harnesses into executable team systems |
| gstack | Sprint phase, command, host adapter | Phase artifacts, checkpoint, learning | Review, test, ship, browser/device QA | Team mode, auto-update, release gate | Run it like a software factory |
| revfactory/harness | Domain analysis | Agent/skill files, team architecture | Validation and testing, A/B pilot | Generated harness refinement | A harness can generate a harness |
Which Technical Problem Is Being Solved?
| Case | Problem | Technical reading |
|---|---|---|
| OpenAI | Search cost and documentation entropy | Knowledge architecture |
| Anthropic | Long-running self-evaluation bias and runtime coupling | Control-loop and runtime-boundary engineering |
| Toss | Team distribution and reproducibility | Workflow distribution |
| gstack | Parallel work without chaos | Production pipeline design |
| revfactory/harness | Repeating harness design itself | Meta-architecture generation |
2026-05-23 Update Points
| Case | Latest addition |
|---|---|
| OpenAI | Agents SDK model-native harness, sandbox execution, TypeScript sandbox agents, Secure MCP Tunnel, Codex remote/hooks, Developers plugin |
| Anthropic | Claude Code auto mode prompt-injection probe and transcript classifier, Managed Agents session-harness-sandbox split, finance agent templates |
| gstack | 23 specialists, 8 power tools, 10 AI coding agent hosts, team mode auto-update, iOS live-device QA, checkpoint and learning flows |
| revfactory/harness | v1.2.0 L3 Meta-Factory / Team-Architecture Factory, marketplace install, Harness 100, author-measured A/B results with caveat |
Detailed Interpretation
Recommended Order
| Need | Read first |
|---|---|
| Improve repo and docs | OpenAI |
| Design evaluation loops and retry budgets | Anthropic |
| Roll out team workflows | Toss |
| Build opinionated sprint pipelines | gstack |
| Generate domain-specific harnesses | revfactory/harness |
Shared Conclusion
- A better work environment matters more than a better single prompt.
- Longer work requires external state and evaluation loops.
- Team adoption requires executable workflows, commands, and approvals.
- Generic templates are starting points; domain-specific harnesses create performance.
- Harnesses must be operated and cleaned up.
References
- OpenAI, "Harness Engineering", 2026-02-11 https://openai.com/ko-KR/index/harness-engineering/
- OpenAI, "The next evolution of the Agents SDK", 2026-04-15 https://openai.com/index/the-next-evolution-of-the-agents-sdk/
- OpenAI, "Work with Codex from anywhere", 2026-05-14 https://openai.com/index/work-with-codex-from-anywhere/
- OpenAI API Changelog https://developers.openai.com/api/docs/changelog
- OpenAI Developers plugin for Codex https://developers.openai.com/learn/developers-codex-plugin
- Anthropic harness design https://www.anthropic.com/engineering/harness-design-long-running-apps
- Anthropic Claude Code auto mode https://www.anthropic.com/engineering/claude-code-auto-mode
- Anthropic Managed Agents https://www.anthropic.com/engineering/managed-agents
- Anthropic financial agents https://www.anthropic.com/news/finance-agents
- Toss harness article https://toss.tech/article/harness-for-team-productivity
- gstack README https://github.com/garrytan/gstack
- revfactory/harness README https://github.com/revfactory/harness