Ch7. Experiment Operations
Repeat prompt, model, and workflow experiments quickly and safely
Key takeaways
- The goal of experimentation is faster decisions, not more experiments, so define business meaning before statistical significance.
- Score variants with a decision formula (ΔQuality − λ_cost·ΔCost − λ_risk·ΔRisk) and tune the λ weights to your growth, margin, and safety strategy.
- Triangulate evidence across offline eval runs, production trace samples, segment metrics, and cost/latency budgets.
- Keep experiments interpretable: limit concurrency, run high-risk features only in pre-approved windows, and record MCP server, tool scope, and routing per variant.
- Store failed experiments and trace-derived regression cases as reusable knowledge in the eval set.
Good experimentation is not about running more experiments. It is about faster decisions.
Experiment design should define business meaning before statistical significance.
Experiment Units
| Target | Examples | Risk |
|---|---|---|
| Prompt | Instruction structure, constraints | Quality variance |
| Model | Provider/version change | Cost and quality shift together |
| Workflow | Tool-call order, approval condition | Safety impact |
Decision Formula
- Tune
λvalues to match your organization’s growth, margin, and safety strategy. - Block PII exposure, privilege escalation, and unapproved side effects before scoring.
Evidence Units
| Evidence | Use |
|---|---|
| Offline eval run | Basic regression check for prompt/model changes |
| Production trace sample | Verify actual tool/handoff/approval paths |
| Segment metric | Detect regressions that appear only in specific tenants, languages, or channels |
| Cost/latency budget | Decide whether a quality gain is economically valid |
Experiment Discipline
- Limit concurrent experiments so results remain interpretable.
- Run high-risk features only inside pre-approved experiment windows.
- Store failed experiments as knowledge assets to avoid repeated attempts.
- Record MCP server, tool scope, and model routing policy per variant in a registry.
- Promote trace-derived regression cases into the eval set after the experiment ends.
Experiment Review Template
# Experiment Review
- Hypothesis:
- Variant A/B:
- Target Metric:
- Result (Quality/Cost/Risk):
- Decision: Rollout / Iterate / RollbackBaseline and Sources
| Item | Baseline Date | Recheck By | Primary Source |
|---|---|---|---|
| Trace-derived regression | 2026-05-17 | 2026-06-16 | https://developers.openai.com/api/docs/guides/agent-evals |
| MCP/tool scope experiment control | 2026-05-17 | 2026-06-16 | https://owasp.org/www-project-mcp-top-10/ |