Evaluation Loop Design
Use Anthropic and gstack patterns to decide when planner, builder, evaluator, and QA should be separated.
Key takeaways
- Builder-only loops are fast but underestimate scope, declare done too early, and miss what an independent reviewer would catch.
- Choose the loop by work type: builder-only for drafts, planner+builder+reviewer for refactors, add browser QA for UI, and HITL for deploy or migration.
- Anthropic's point is not "always add planner and evaluator" but to measure where a role actually catches failures the builder routinely misses.
- gstack respects test and ship as separate stages within Think to Plan to Build to Review to Test to Ship to Reflect.
- Browser QA verifies behavior rather than intention, so it is often load-bearing for UI products, and a sprint contract tells reviewers what to judge.
Harness quality often comes from the verification loop as much as from the document structure. Anthropic and gstack are especially useful here.
Why Builder-Only Loops Fail
When one agent plans, implements, reviews, and QA-checks everything, it can move fast but fail in predictable ways.
- Scope is underestimated.
- Requirements are declared satisfied too early.
- Browser or runtime behavior is not checked deeply enough.
- The agent misses what an independent reviewer would catch.
Anthropic frames this as the question: which scaffolding is load-bearing?
Basic Loop
Loop Options
Pros:
- Fastest path.
- Low process overhead.
Cons:
- Strong self-evaluation bias.
- More omissions on risky work.
Use for:
- local experiments;
- one-off scripts;
- very small changes.
Pros:
- Reduces scope underestimation.
- Fixes done criteria before implementation.
Cons:
- Quality judgment can still be optimistic.
Use for:
- medium feature work;
- refactors;
- exploration before structural changes.
Pros:
- Separates requirements from runtime verification.
- Connects browser, tests, and release gates.
Cons:
- Adds cost and latency.
- Slows work if role contracts are vague.
Use for:
- user-facing features;
- large changes;
- deployment risk.
What to Learn from Anthropic
Anthropic's useful message is not "always use planner and evaluator." It is measure where they are load-bearing.
- Planners reduce underspecified scope.
- Evaluators inspect from a different perspective than generators.
- Stronger models may reduce some scaffolding.
- Whether scaffolding can be removed must be proven experimentally.
Core Interpretation
The role itself is not the point. The point is whether the role catches failures the builder routinely misses.
What to Learn from gstack
gstack turns the loop into a practical sprint flow.
| Stage | Meaning |
|---|---|
| Think | Define problem and hypothesis |
| Plan | Decide scope and approach |
| Build | Implement |
| Review | Check code, design, and security |
| Test | Verify with browser, tests, and devices |
| Ship | Release and update docs |
| Reflect | Capture learning |
The important part is that test and ship are respected as separate stages.
Which Loop to Use
| Work type | Recommended loop | Why |
|---|---|---|
| Drafts, notes, internal research | Builder only | Low failure cost |
| Docs cleanup, refactor, medium feature | Planner + Builder + Reviewer | Scope and consistency matter |
| UI or user-facing work | Planner + Builder + Reviewer + Browser QA | "Implemented" and "works" differ |
| Deploy, migration, permissions, security | Planner + Builder + Evaluator + HITL | Approval and rollback are mandatory |
Start with a Sprint Contract
task_contract:
goal: "What must be finished"
non_goals:
- "What not to touch"
constraints:
- "Rules that must not break"
validation:
- "Required checks"
escalation:
- "Conditions that stop automation"Without this contract, reviewers and evaluators do not know what to judge.
Why Browser QA Often Unlocks Quality
Text-based self-checks usually verify intention. Browser QA verifies behavior.
- Is the interaction actually working?
- Does the layout break?
- Are loading, error, and edge states handled?
- Is something visibly wrong even if the code looks right?
For UI products, browser QA is often load-bearing.
Questions Before Adding an Evaluator
- Does the builder repeat the same failure type?
- Are problems found only right before release?
- Does the agent say "done" before browser/log/test evidence exists?
- Do reviewers frequently catch requirement omissions?
- Is human approval the last real safety mechanism?
If two or more are true, separate evaluator or QA work.
Conclusion
Good harnesses do not love complexity. They keep only the verification stages that actually support quality.