Case: Anthropic

Analyze Anthropic's long-running harness through planner/evaluator, Managed Agents, and auto approval patterns.

Key takeaways

Anthropic's harness lesson is to separate role, runtime, session log, credential, and approval-policy boundaries, not to stack more agents.
Planner, builder, evaluator, and QA each catch a distinct failure (missing scope, implementation, self-evaluation bias, behavior mismatch).
Managed Agents splits session, harness, and sandbox so a dead sandbox or harness can be retried or resumed from an append-only event log.
Auto mode reduces approval fatigue with prompt-injection probes and transcript classifiers, but still keeps hard human gates for high-risk work.
A sprint contract with done criteria, non-goals, and a retry budget lets the evaluator judge evidence instead of effort.

Anthropic's case is not about adding more roles. The core question is:

Which scaffolding actually carries quality?

That matters because long-running failures often come from underspecified scope, self-evaluation bias, and weak handoff, not just model weakness.

2026 Update: More Than Planner and Evaluator

Anthropic should now be read beyond planner/evaluator scaffolding. The 2026-03-25 Claude Code auto mode article and the 2026-04-08 Managed Agents article extend harness design into runtime and permission boundaries.

Update	Harness meaning
Claude Code auto mode	Reduce approval fatigue while using prompt-injection probes and transcript classifiers to block risky behavior
Managed Agents	Separate session, harness, and sandbox interfaces for recovery, security, and scale
Financial agent templates	Package domain templates with skills, connectors, subagents, approval flow, and audit log

The updated Anthropic lesson is: separate role, runtime, session log, credential, and approval-policy boundaries.

Problems Solved

The builder declares completion too early.
Goal and state drift during long work.
Implementation finishes before browser verification.
Many roles exist, but the load-bearing step is unclear.

Technical Mechanism

Role	Main input	Output	Failure caught
Planner	Goal, constraints, non-goals	Scope and contract	Missing scope
Builder	Plan and files	Code or docs diff	Implementation
Evaluator	Output and requirements	Pass/fail and gap report	Self-evaluation bias
QA	Browser, tests, logs	Runtime verification	Behavior mismatch

Managed Agents: Separate Brain, Hands, and Session

Managed Agents treats the agent as more than one long process.

Component	Role	Failure if coupled
Session	Append-only event log and durable context	State loss when container or harness fails
Harness	Claude calls, tool routing, context management	Recovery and replacement become difficult
Sandbox / hands	Code execution, file edits, external tools	Credentials and generated code share a boundary

type Hand = {
  execute(name: string, input: unknown): Promise<string>
}

type SessionStore = {
  getSession(id: string): Promise<Event[]>
  emitEvent(id: string, event: Event): Promise<void>
}

The split gives three practical benefits:

If a sandbox dies, the harness receives it as a tool result and can retry.
If the harness dies, it can resume from the session log.
Credentials can live behind a vault or MCP proxy instead of inside sandbox code.

Long-running state should be a recoverable event log, not just the current context window.

Auto Mode: Approval as Classifier Policy

Claude Code auto mode is not "remove permission prompts." It is "reduce prompt fatigue while keeping risk boundaries."

Design points:

Tool output can be scanned for prompt-injection signals before entering agent context.
A transcript classifier checks user intent against the tool call's blast radius.
Broad shell access, wildcard interpreters, and package-manager runs should not survive as blanket allows.
Subagent handoff should be checked at delegation and return.
Denial should lead to a safer path first, then human escalation after repeated blocks.

Do Not Overread This

Auto approval classifiers do not replace human review for high-risk work. Sensitive operations still need explicit trust boundaries, deny rules, and hard human gates.

Why This Is Engineering

Anthropic treats quality as a control loop with separated bias.

Planners reduce early scope error.
Evaluators break builder optimism.
QA separates text judgment from runtime behavior.
Handoff reduces state loss.
Session/harness/sandbox separation reduces failure coupling.
Classifier policy reduces approval fatigue without eliminating gates.

Sprint Contract

sprint_contract:
  objective: "Improve the user-visible search panel"
  done_when:
    - "Renders on mobile and desktop"
    - "Keyboard navigation works"
    - "lint/typecheck/build pass"
  non_goals:
    - "Search API redesign"
  failure_budget:
    max_fix_loops: 3
  escalation:
    - "Accessibility regression found"
    - "Data schema change required"

This contract lets the evaluator judge evidence rather than effort.

What to Bring Home

Element	How to apply
task/sprint contract	Fix done criteria and non-goals before work starts
evaluator separation	Produce an independent gap report
retry budget	Limit repeated failures
browser QA handoff	Separate UI/runtime verification
durable session log	Store long-task events outside the context window
brain/hands split	Treat sandbox, MCP, and external tools as separate execution boundaries
classifier approval	Separate auto allow, classifier review, human approval, and deny

What Not to Copy Blindly

More planners and evaluators are not always better.
Small work does not need the full pipeline.
Stronger models can remove some scaffolding.
Auto mode and Managed Agents do not remove HITL for high-risk domains.

The real lesson is not "make everything complex." It is keep only load-bearing structures while separating state, execution, and permission boundaries.

References

Anthropic, Harness design for long-running application development, 2026-03-24
Anthropic, Claude Code auto mode: a safer way to skip permissions, 2026-03-25
Claude Code docs, Choose a permission mode / Configure auto mode, read baseline 2026-05-23
Anthropic, Scaling Managed Agents: Decoupling the brain from the hands, 2026-04-08
Claude Managed Agents docs, Overview / MCP connector, read baseline 2026-05-23
Anthropic, Agents for financial services, 2026-05-05

Key takeaways

Anthropic's harness lesson is to separate role, runtime, session log, credential, and approval-policy boundaries, not to stack more agents.
Planner, builder, evaluator, and QA each catch a distinct failure (missing scope, implementation, self-evaluation bias, behavior mismatch).
Managed Agents splits session, harness, and sandbox so a dead sandbox or harness can be retried or resumed from an append-only event log.
Auto mode reduces approval fatigue with prompt-injection probes and transcript classifiers, but still keeps hard human gates for high-risk work.
A sprint contract with done criteria, non-goals, and a retry budget lets the evaluator judge evidence instead of effort.

Anthropic's case is not about adding more roles. The core question is:

Which scaffolding actually carries quality?

That matters because long-running failures often come from underspecified scope, self-evaluation bias, and weak handoff, not just model weakness.

2026 Update: More Than Planner and Evaluator

Update	Harness meaning
Claude Code auto mode	Reduce approval fatigue while using prompt-injection probes and transcript classifiers to block risky behavior
Managed Agents	Separate session, harness, and sandbox interfaces for recovery, security, and scale
Financial agent templates	Package domain templates with skills, connectors, subagents, approval flow, and audit log

The updated Anthropic lesson is: separate role, runtime, session log, credential, and approval-policy boundaries.

Problems Solved

The builder declares completion too early.
Goal and state drift during long work.
Implementation finishes before browser verification.
Many roles exist, but the load-bearing step is unclear.

Technical Mechanism

Role	Main input	Output	Failure caught
Planner	Goal, constraints, non-goals	Scope and contract	Missing scope
Builder	Plan and files	Code or docs diff	Implementation
Evaluator	Output and requirements	Pass/fail and gap report	Self-evaluation bias
QA	Browser, tests, logs	Runtime verification	Behavior mismatch

Managed Agents: Separate Brain, Hands, and Session

Managed Agents treats the agent as more than one long process.

Component	Role	Failure if coupled
Session	Append-only event log and durable context	State loss when container or harness fails
Harness	Claude calls, tool routing, context management	Recovery and replacement become difficult
Sandbox / hands	Code execution, file edits, external tools	Credentials and generated code share a boundary

type Hand = {
  execute(name: string, input: unknown): Promise<string>
}

type SessionStore = {
  getSession(id: string): Promise<Event[]>
  emitEvent(id: string, event: Event): Promise<void>
}

The split gives three practical benefits:

If a sandbox dies, the harness receives it as a tool result and can retry.
If the harness dies, it can resume from the session log.
Credentials can live behind a vault or MCP proxy instead of inside sandbox code.

Long-running state should be a recoverable event log, not just the current context window.

Auto Mode: Approval as Classifier Policy

Claude Code auto mode is not "remove permission prompts." It is "reduce prompt fatigue while keeping risk boundaries."

Design points:

Tool output can be scanned for prompt-injection signals before entering agent context.
A transcript classifier checks user intent against the tool call's blast radius.
Broad shell access, wildcard interpreters, and package-manager runs should not survive as blanket allows.
Subagent handoff should be checked at delegation and return.
Denial should lead to a safer path first, then human escalation after repeated blocks.

Do Not Overread This

Auto approval classifiers do not replace human review for high-risk work. Sensitive operations still need explicit trust boundaries, deny rules, and hard human gates.

Why This Is Engineering

Anthropic treats quality as a control loop with separated bias.

Planners reduce early scope error.
Evaluators break builder optimism.
QA separates text judgment from runtime behavior.
Handoff reduces state loss.
Session/harness/sandbox separation reduces failure coupling.
Classifier policy reduces approval fatigue without eliminating gates.

Sprint Contract

sprint_contract:
  objective: "Improve the user-visible search panel"
  done_when:
    - "Renders on mobile and desktop"
    - "Keyboard navigation works"
    - "lint/typecheck/build pass"
  non_goals:
    - "Search API redesign"
  failure_budget:
    max_fix_loops: 3
  escalation:
    - "Accessibility regression found"
    - "Data schema change required"

This contract lets the evaluator judge evidence rather than effort.

What to Bring Home

Element	How to apply
task/sprint contract	Fix done criteria and non-goals before work starts
evaluator separation	Produce an independent gap report
retry budget	Limit repeated failures
browser QA handoff	Separate UI/runtime verification
durable session log	Store long-task events outside the context window
brain/hands split	Treat sandbox, MCP, and external tools as separate execution boundaries
classifier approval	Separate auto allow, classifier review, human approval, and deny

What Not to Copy Blindly

More planners and evaluators are not always better.
Small work does not need the full pipeline.
Stronger models can remove some scaffolding.
Auto mode and Managed Agents do not remove HITL for high-risk domains.

The real lesson is not "make everything complex." It is keep only load-bearing structures while separating state, execution, and permission boundaries.

References

Anthropic, Harness design for long-running application development, 2026-03-24
Anthropic, Claude Code auto mode: a safer way to skip permissions, 2026-03-25
Claude Code docs, Choose a permission mode / Configure auto mode, read baseline 2026-05-23
Anthropic, Scaling Managed Agents: Decoupling the brain from the hands, 2026-04-08
Claude Managed Agents docs, Overview / MCP connector, read baseline 2026-05-23
Anthropic, Agents for financial services, 2026-05-05

2026 Update: More Than Planner and Evaluator

Problems Solved

Technical Mechanism

Managed Agents: Separate Brain, Hands, and Session

Auto Mode: Approval as Classifier Policy

Why This Is Engineering

Sprint Contract

What to Bring Home

What Not to Copy Blindly

References

On This Page

Case: Anthropic

2026 Update: More Than Planner and Evaluator

Problems Solved

Technical Mechanism

Managed Agents: Separate Brain, Hands, and Session

Auto Mode: Approval as Classifier Policy

Why This Is Engineering

Sprint Contract

What to Bring Home

What Not to Copy Blindly

References

On This Page