Ch7. Experiment Operations

Key takeaways

The goal of experimentation is faster decisions, not more experiments, so define business meaning before statistical significance.
Score variants with a decision formula (ΔQuality − λ_cost·ΔCost − λ_risk·ΔRisk) and tune the λ weights to your growth, margin, and safety strategy.
Triangulate evidence across offline eval runs, production trace samples, segment metrics, and cost/latency budgets.
Keep experiments interpretable: limit concurrency, run high-risk features only in pre-approved windows, and record MCP server, tool scope, and routing per variant.
Store failed experiments and trace-derived regression cases as reusable knowledge in the eval set.

Good experimentation is not about running more experiments. It is about faster decisions.
Experiment design should define business meaning before statistical significance.

Experiment Units

Target	Examples	Risk
Prompt	Instruction structure, constraints	Quality variance
Model	Provider/version change	Cost and quality shift together
Workflow	Tool-call order, approval condition	Safety impact

Decision Formula

\text{Decision Score} = \Delta Quality - \lambda_{cost}\Delta Cost - \lambda_{risk}\Delta Risk

Tune λ values to match your organization’s growth, margin, and safety strategy.
Block PII exposure, privilege escalation, and unapproved side effects before scoring.

Evidence Units

Evidence	Use
Offline eval run	Basic regression check for prompt/model changes
Production trace sample	Verify actual tool/handoff/approval paths
Segment metric	Detect regressions that appear only in specific tenants, languages, or channels
Cost/latency budget	Decide whether a quality gain is economically valid

Experiment Discipline

Limit concurrent experiments so results remain interpretable.
Run high-risk features only inside pre-approved experiment windows.
Store failed experiments as knowledge assets to avoid repeated attempts.
Record MCP server, tool scope, and model routing policy per variant in a registry.
Promote trace-derived regression cases into the eval set after the experiment ends.

Experiment Review Template

# Experiment Review

- Hypothesis:
- Variant A/B:
- Target Metric:
- Result (Quality/Cost/Risk):
- Decision: Rollout / Iterate / Rollback

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
Trace-derived regression	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agent-evals
MCP/tool scope experiment control	2026-05-17	2026-06-16	https://owasp.org/www-project-mcp-top-10/

Ch7. Experiment Operations

Repeat prompt, model, and workflow experiments quickly and safely

Key takeaways

The goal of experimentation is faster decisions, not more experiments, so define business meaning before statistical significance.
Score variants with a decision formula (ΔQuality − λ_cost·ΔCost − λ_risk·ΔRisk) and tune the λ weights to your growth, margin, and safety strategy.
Triangulate evidence across offline eval runs, production trace samples, segment metrics, and cost/latency budgets.
Keep experiments interpretable: limit concurrency, run high-risk features only in pre-approved windows, and record MCP server, tool scope, and routing per variant.
Store failed experiments and trace-derived regression cases as reusable knowledge in the eval set.

Good experimentation is not about running more experiments. It is about faster decisions.
Experiment design should define business meaning before statistical significance.

Experiment Units

Target	Examples	Risk
Prompt	Instruction structure, constraints	Quality variance
Model	Provider/version change	Cost and quality shift together
Workflow	Tool-call order, approval condition	Safety impact

Decision Formula

\text{Decision Score} = \Delta Quality - \lambda_{cost}\Delta Cost - \lambda_{risk}\Delta Risk

Tune λ values to match your organization’s growth, margin, and safety strategy.
Block PII exposure, privilege escalation, and unapproved side effects before scoring.

Evidence Units

Evidence	Use
Offline eval run	Basic regression check for prompt/model changes
Production trace sample	Verify actual tool/handoff/approval paths
Segment metric	Detect regressions that appear only in specific tenants, languages, or channels
Cost/latency budget	Decide whether a quality gain is economically valid

Experiment Discipline

Limit concurrent experiments so results remain interpretable.
Run high-risk features only inside pre-approved experiment windows.
Store failed experiments as knowledge assets to avoid repeated attempts.
Record MCP server, tool scope, and model routing policy per variant in a registry.
Promote trace-derived regression cases into the eval set after the experiment ends.

Experiment Review Template

# Experiment Review

- Hypothesis:
- Variant A/B:
- Target Metric:
- Result (Quality/Cost/Risk):
- Decision: Rollout / Iterate / Rollback

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
Trace-derived regression	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agent-evals
MCP/tool scope experiment control	2026-05-17	2026-06-16	https://owasp.org/www-project-mcp-top-10/

Experiment Units

Decision Formula

Evidence Units

Experiment Discipline

Experiment Review Template

Baseline and Sources

On This Page

Ch7. Experiment Operations

Experiment Units

Decision Formula

Evidence Units

Experiment Discipline

Experiment Review Template

Baseline and Sources

On This Page