Scenario: AI Product Team

Design a harness where eval sets, safety policy, online telemetry, and model rollout control nondeterministic behavior.

Key takeaways

AI product harnesses resemble code harnesses but must also control model nondeterminism and drift.
Load-bearing elements are an eval set, a versioned prompt/policy spec, canary or shadow rollout, and online telemetry.
The recommended loop runs prompt/policy change to offline eval to safety checks to shadow/canary to online telemetry, rolling back if unhealthy.
A rollout plan defines offline must-pass thresholds, a small canary traffic percentage, and explicit rollback triggers like a >10% success drop.
First 30 days: build a 20-50 flow eval-set.jsonl, version prompt and policy changes, and define a 5 percent canary with rollback thresholds.

AI product harnesses resemble code harnesses, but they must also handle model nondeterminism and drift.

Problem Structure

The same change may produce different model behavior.
Prompt or policy changes can degrade quality without obvious code diffs.
Offline success may fail online.
Safety policy and cost budget can drift away from implementation.

Load-Bearing Elements

Element	Why it matters
Eval set	Prevents quality judgment by feel
Prompt / policy spec	Makes changes comparable
Canary / shadow rollout	Limits online blast radius
Telemetry	Tracks quality, cost, and failure patterns

Artifact Structure

prompt-spec.md

safety-policy.md

tool-permissions.md

eval-set.jsonl

rubric.md

baseline-report.md

rollout-plan.yaml

online-observations.md

rollback-thresholds.yaml

Rollout Plan Example

model_change:
  offline_must_pass:
    - "task success >= baseline"
    - "safety violation <= baseline"
  online_canary:
    traffic_percent: 5
    watch:
      - "completion success"
      - "tool failure rate"
      - "cost per successful task"
  rollback_if:
    - "success drops more than 10%"
    - "safety incidents increase"

An AI product harness is nondeterministic system control. It versions prompt and policy specs, anchors quality in offline evals, watches online telemetry, and uses canary or shadow rollout to reduce blast radius.

First 30 Days

Build a small eval-set.jsonl with 20 to 50 core user flows.
Version prompt, spec, and policy changes.
Define a 5 percent canary and rollback thresholds before full rollout.

Problem Structure

The same change may produce different model behavior.
Prompt or policy changes can degrade quality without obvious code diffs.
Offline success may fail online.
Safety policy and cost budget can drift away from implementation.

Load-Bearing Elements

Element	Why it matters
Eval set	Prevents quality judgment by feel
Prompt / policy spec	Makes changes comparable
Canary / shadow rollout	Limits online blast radius
Telemetry	Tracks quality, cost, and failure patterns

Recommended Loop

Artifact Structure

prompt-spec.md

safety-policy.md

tool-permissions.md

eval-set.jsonl

rubric.md

baseline-report.md

rollout-plan.yaml

online-observations.md

rollback-thresholds.yaml

Rollout Plan Example

model_change:
  offline_must_pass:
    - "task success >= baseline"
    - "safety violation <= baseline"
  online_canary:
    traffic_percent: 5
    watch:
      - "completion success"
      - "tool failure rate"
      - "cost per successful task"
  rollback_if:
    - "success drops more than 10%"
    - "safety incidents increase"

Why This Is Engineering

First 30 Days

Build a small eval-set.jsonl with 20 to 50 core user flows.
Version prompt, spec, and policy changes.
Define a 5 percent canary and rollback thresholds before full rollout.

Problem Structure

Load-Bearing Elements

Recommended Loop

Artifact Structure

Rollout Plan Example

Why This Is Engineering

First 30 Days

Read Next

On This Page

Scenario: AI Product Team

Problem Structure

Load-Bearing Elements

Recommended Loop

Artifact Structure

Rollout Plan Example

Why This Is Engineering

First 30 Days

Read Next

On This Page