Scenario: AI Product Team
Design a harness where eval sets, safety policy, online telemetry, and model rollout control nondeterministic behavior.
Key takeaways
- AI product harnesses resemble code harnesses but must also control model nondeterminism and drift.
- Load-bearing elements are an eval set, a versioned prompt/policy spec, canary or shadow rollout, and online telemetry.
- The recommended loop runs prompt/policy change to offline eval to safety checks to shadow/canary to online telemetry, rolling back if unhealthy.
- A rollout plan defines offline must-pass thresholds, a small canary traffic percentage, and explicit rollback triggers like a >10% success drop.
- First 30 days: build a 20-50 flow eval-set.jsonl, version prompt and policy changes, and define a 5 percent canary with rollback thresholds.
AI product harnesses resemble code harnesses, but they must also handle model nondeterminism and drift.
Problem Structure
- The same change may produce different model behavior.
- Prompt or policy changes can degrade quality without obvious code diffs.
- Offline success may fail online.
- Safety policy and cost budget can drift away from implementation.
Load-Bearing Elements
| Element | Why it matters |
|---|---|
| Eval set | Prevents quality judgment by feel |
| Prompt / policy spec | Makes changes comparable |
| Canary / shadow rollout | Limits online blast radius |
| Telemetry | Tracks quality, cost, and failure patterns |
Recommended Loop
Artifact Structure
prompt-spec.md
safety-policy.md
tool-permissions.md
eval-set.jsonl
rubric.md
baseline-report.md
rollout-plan.yaml
online-observations.md
rollback-thresholds.yaml
Rollout Plan Example
model_change:
offline_must_pass:
- "task success >= baseline"
- "safety violation <= baseline"
online_canary:
traffic_percent: 5
watch:
- "completion success"
- "tool failure rate"
- "cost per successful task"
rollback_if:
- "success drops more than 10%"
- "safety incidents increase"Why This Is Engineering
An AI product harness is nondeterministic system control. It versions prompt and policy specs, anchors quality in offline evals, watches online telemetry, and uses canary or shadow rollout to reduce blast radius.
First 30 Days
- Build a small
eval-set.jsonlwith 20 to 50 core user flows. - Version prompt, spec, and policy changes.
- Define a 5 percent canary and rollback thresholds before full rollout.