Evaluation Loop Design

Use Anthropic and gstack patterns to decide when planner, builder, evaluator, and QA should be separated.

Key takeaways

Builder-only loops are fast but underestimate scope, declare done too early, and miss what an independent reviewer would catch.
Choose the loop by work type: builder-only for drafts, planner+builder+reviewer for refactors, add browser QA for UI, and HITL for deploy or migration.
Anthropic's point is not "always add planner and evaluator" but to measure where a role actually catches failures the builder routinely misses.
gstack respects test and ship as separate stages within Think to Plan to Build to Review to Test to Ship to Reflect.
Browser QA verifies behavior rather than intention, so it is often load-bearing for UI products, and a sprint contract tells reviewers what to judge.

Harness quality often comes from the verification loop as much as from the document structure. Anthropic and gstack are especially useful here.

Why Builder-Only Loops Fail

When one agent plans, implements, reviews, and QA-checks everything, it can move fast but fail in predictable ways.

Scope is underestimated.
Requirements are declared satisfied too early.
Browser or runtime behavior is not checked deeply enough.
The agent misses what an independent reviewer would catch.

Anthropic frames this as the question: which scaffolding is load-bearing?

Basic Loop

Loop Options

Pros:

Fastest path.
Low process overhead.

Cons:

Strong self-evaluation bias.
More omissions on risky work.

Use for:

local experiments;
one-off scripts;
very small changes.

Pros:

Reduces scope underestimation.
Fixes done criteria before implementation.

Cons:

Quality judgment can still be optimistic.

Use for:

medium feature work;
refactors;
exploration before structural changes.

Pros:

Separates requirements from runtime verification.
Connects browser, tests, and release gates.

Cons:

Adds cost and latency.
Slows work if role contracts are vague.

Use for:

user-facing features;
large changes;
deployment risk.

What to Learn from Anthropic

Anthropic's useful message is not "always use planner and evaluator." It is measure where they are load-bearing.

Planners reduce underspecified scope.
Evaluators inspect from a different perspective than generators.
Stronger models may reduce some scaffolding.
Whether scaffolding can be removed must be proven experimentally.

Core Interpretation

The role itself is not the point. The point is whether the role catches failures the builder routinely misses.

What to Learn from gstack

gstack turns the loop into a practical sprint flow.

Stage	Meaning
Think	Define problem and hypothesis
Plan	Decide scope and approach
Build	Implement
Review	Check code, design, and security
Test	Verify with browser, tests, and devices
Ship	Release and update docs
Reflect	Capture learning

The important part is that test and ship are respected as separate stages.

Which Loop to Use

Work type	Recommended loop	Why
Drafts, notes, internal research	Builder only	Low failure cost
Docs cleanup, refactor, medium feature	Planner + Builder + Reviewer	Scope and consistency matter
UI or user-facing work	Planner + Builder + Reviewer + Browser QA	"Implemented" and "works" differ
Deploy, migration, permissions, security	Planner + Builder + Evaluator + HITL	Approval and rollback are mandatory

Start with a Sprint Contract

task_contract:
  goal: "What must be finished"
  non_goals:
    - "What not to touch"
  constraints:
    - "Rules that must not break"
  validation:
    - "Required checks"
  escalation:
    - "Conditions that stop automation"

Without this contract, reviewers and evaluators do not know what to judge.

Why Browser QA Often Unlocks Quality

Text-based self-checks usually verify intention. Browser QA verifies behavior.

Is the interaction actually working?
Does the layout break?
Are loading, error, and edge states handled?
Is something visibly wrong even if the code looks right?

For UI products, browser QA is often load-bearing.

Questions Before Adding an Evaluator

Does the builder repeat the same failure type?
Are problems found only right before release?
Does the agent say "done" before browser/log/test evidence exists?
Do reviewers frequently catch requirement omissions?
Is human approval the last real safety mechanism?

If two or more are true, separate evaluator or QA work.

Conclusion

Good harnesses do not love complexity. They keep only the verification stages that actually support quality.

Key takeaways

Builder-only loops are fast but underestimate scope, declare done too early, and miss what an independent reviewer would catch.
Choose the loop by work type: builder-only for drafts, planner+builder+reviewer for refactors, add browser QA for UI, and HITL for deploy or migration.
Anthropic's point is not "always add planner and evaluator" but to measure where a role actually catches failures the builder routinely misses.
gstack respects test and ship as separate stages within Think to Plan to Build to Review to Test to Ship to Reflect.
Browser QA verifies behavior rather than intention, so it is often load-bearing for UI products, and a sprint contract tells reviewers what to judge.

Harness quality often comes from the verification loop as much as from the document structure. Anthropic and gstack are especially useful here.

Why Builder-Only Loops Fail

When one agent plans, implements, reviews, and QA-checks everything, it can move fast but fail in predictable ways.

Scope is underestimated.
Requirements are declared satisfied too early.
Browser or runtime behavior is not checked deeply enough.
The agent misses what an independent reviewer would catch.

Anthropic frames this as the question: which scaffolding is load-bearing?

Basic Loop

Loop Options

Pros:

Fastest path.
Low process overhead.

Cons:

Strong self-evaluation bias.
More omissions on risky work.

Use for:

local experiments;
one-off scripts;
very small changes.

Pros:

Reduces scope underestimation.
Fixes done criteria before implementation.

Cons:

Quality judgment can still be optimistic.

Use for:

medium feature work;
refactors;
exploration before structural changes.

Pros:

Separates requirements from runtime verification.
Connects browser, tests, and release gates.

Cons:

Adds cost and latency.
Slows work if role contracts are vague.

Use for:

user-facing features;
large changes;
deployment risk.

What to Learn from Anthropic

Anthropic's useful message is not "always use planner and evaluator." It is measure where they are load-bearing.

Planners reduce underspecified scope.
Evaluators inspect from a different perspective than generators.
Stronger models may reduce some scaffolding.
Whether scaffolding can be removed must be proven experimentally.

Core Interpretation

The role itself is not the point. The point is whether the role catches failures the builder routinely misses.

What to Learn from gstack

gstack turns the loop into a practical sprint flow.

Stage	Meaning
Think	Define problem and hypothesis
Plan	Decide scope and approach
Build	Implement
Review	Check code, design, and security
Test	Verify with browser, tests, and devices
Ship	Release and update docs
Reflect	Capture learning

The important part is that test and ship are respected as separate stages.

Which Loop to Use

Work type	Recommended loop	Why
Drafts, notes, internal research	Builder only	Low failure cost
Docs cleanup, refactor, medium feature	Planner + Builder + Reviewer	Scope and consistency matter
UI or user-facing work	Planner + Builder + Reviewer + Browser QA	"Implemented" and "works" differ
Deploy, migration, permissions, security	Planner + Builder + Evaluator + HITL	Approval and rollback are mandatory

Start with a Sprint Contract

task_contract:
  goal: "What must be finished"
  non_goals:
    - "What not to touch"
  constraints:
    - "Rules that must not break"
  validation:
    - "Required checks"
  escalation:
    - "Conditions that stop automation"

Without this contract, reviewers and evaluators do not know what to judge.

Why Browser QA Often Unlocks Quality

Text-based self-checks usually verify intention. Browser QA verifies behavior.

Is the interaction actually working?
Does the layout break?
Are loading, error, and edge states handled?
Is something visibly wrong even if the code looks right?

For UI products, browser QA is often load-bearing.

Questions Before Adding an Evaluator

Does the builder repeat the same failure type?
Are problems found only right before release?
Does the agent say "done" before browser/log/test evidence exists?
Do reviewers frequently catch requirement omissions?
Is human approval the last real safety mechanism?

If two or more are true, separate evaluator or QA work.

Conclusion

Good harnesses do not love complexity. They keep only the verification stages that actually support quality.

Why Builder-Only Loops Fail

Basic Loop

Loop Options

What to Learn from Anthropic

What to Learn from gstack

Which Loop to Use

Start with a Sprint Contract

Why Browser QA Often Unlocks Quality

Questions Before Adding an Evaluator

Conclusion

On This Page

Evaluation Loop Design

Why Builder-Only Loops Fail

Basic Loop

Loop Options

What to Learn from Anthropic

What to Learn from gstack

Which Loop to Use

Start with a Sprint Contract

Why Browser QA Often Unlocks Quality

Questions Before Adding an Evaluator

Conclusion

On This Page