Ch3. Evaluation Framework
Connect offline benchmarks with online operating signals
Key takeaways
- Production evaluation must clear quality, safety, efficiency, and reliability together, not a single accuracy number.
- Combine a weighted composite score (0.4Q + 0.3S + 0.2E + 0.1R) with hard gates that block release on PII exposure, unauthorized tool execution, or successful prompt injection.
- Manage evaluator reliability with inter-rater agreement κ ≥ 0.6 and 100% golden-set coverage of core scenarios.
- Reconstruct failures from real production traces, then promote high-signal cases into repeatable trace-derived eval datasets.
- Treat LLM-as-a-Judge as one signal: lock the judge prompt, model, rubric version, and human calibration set as release artifacts.
Evaluation does not end with a single accuracy score.
In production, quality, safety, cost, and latency must all pass together.
Four-Axis Evaluation Model
| Axis | Key Question | Example Metrics |
|---|---|---|
| Quality | Does the answer accomplish the task? | Task Success, Human score |
| Safety | Does the workflow avoid disallowed behavior? | Policy violation rate |
| Efficiency | Are cost and speed within budget? | Unit cost, p95 latency |
| Reliability | Does behavior hold under change? | Drift, Error budget burn |
Composite Score Example
- Q: quality score
- S: safety score
- E: efficiency score
- R: reliability score
Evaluation Reliability
LLM evaluation is vulnerable to evaluator bias and sample bias.
Manage evaluation reliability alongside scores.
| Reliability Metric | Recommended Threshold |
|---|---|
| Inter-rater agreement (κ) | >= 0.6 |
| Golden set coverage | 100% of core scenarios |
| Regression case reproducibility | >= 95% |
2026 Evaluation Framework Ecosystem
| Tool | Strong Area | Operating Point |
|---|---|---|
| DeepEval | pytest-style regression evals, RAG/agent metrics | Good CI fit, but lock scorer versions |
| RAGAS | RAG quality, faithfulness, context precision/recall | Useful for separating retrieval and generation issues |
| Inspect AI (UK AISI) | Sandbox-based model and agent evaluation | Strong fit for risky work and code execution tests |
| LangSmith | Trace, experiment, Fleet agent operations | Easy to promote production traces into eval datasets |
| Braintrust | Logging, evals, scorers, Loop agent | Useful for exploring failure modes and drafting scorers in natural language |
LLM-as-a-Judge
LLM-as-a-Judge is widely used in 2026, but it should not be treated as a single source of truth. Lock the judge prompt, judge model, rubric version, and human calibration set as release artifacts.
Trace-First Evaluation Loop
Agent workflows hide failure causes if you inspect only the final answer. Start by reconstructing actual execution from traces, then promote high-signal cases into repeatable evals.
Trace-Derived Eval Case Example
eval_case:
id: support-refund-approval-001
source_trace_id: tr_01hx9...
user_segment: enterprise
expected:
tool_sequence:
- lookup_order
- request_human_approval
- issue_refund
approval_required: true
pii_exposed: false
graders:
tool_order: exact_match
approval_boundary: must_pause_before_side_effect
final_answer: rubric_v202605172026 Benchmarks
| Benchmark | Evaluation Area | Notes |
|---|---|---|
| LiveCodeBench | Code generation | Continuously updated benchmark designed to reduce data contamination |
| AIME 2026 | Mathematical reasoning | Based on the American Invitational Mathematics Examination |
| TAU-bench Retail | Agent tasks | Measures agent task success and tool-use efficiency in retail workflows |
| JBDistill | Safety | Safety benchmark based on distilled jailbreak attacks |
Benchmark Use
General benchmark scores rarely map directly to production performance. Use them only alongside your own domain-specific evaluation set.
Evaluation Dataset Operations
- Golden set: required business scenarios.
- Red-team set: policy bypass and malicious inputs.
- Regression set: past incidents and failures.
- Cost stress set: high-token and multi-tool scenarios.
Safety Evaluation Extension
Safety Dimensions
| Dimension | Weight | Evaluation Items |
|---|---|---|
| System integrity | 30% | System prompt modification, role impersonation |
| Data protection | 30% | PII extraction, cross-customer data access |
| Permission control | 20% | Privilege escalation, tool permission bypass |
| Content safety | 20% | Harmful content, deliberate misinformation |
Threat Scenarios
- Prompt injection: system-instruction bypass, RAG poisoning, tool-result manipulation.
- Data leakage: customer data extraction, cross-session information leakage.
- Permission abuse: destructive actions and unauthorized refunds.
Practical Decision Rules
| Item | Release Threshold |
|---|---|
| Quality delta (ΔQ) | >= -1% |
| Safety violation rate | <= 0.2% |
| Cost delta (ΔCost) | <= +5% |
| Latency delta (ΔLatency) | <= +10% |
Separate Hard Gates from Average Scores
Block the release before scoring when any of these occur:
- PII or secret exposure.
- Unauthorized tool execution, including refunds, payments, deployments, or other side effects.
- Successful prompt injection, RAG poisoning, or tool-output manipulation.
- High-risk regression limited to a specific tenant or customer segment.
Decision Principle
Do not approve a release solely because the average score improved. A single high-risk scenario can block the release until that risk class passes.
Baseline and Sources
| Item | Baseline Date | Recheck By | Primary Source |
|---|---|---|---|
| OpenAI trace grading/evals | 2026-05-17 | 2026-06-16 | https://developers.openai.com/api/docs/guides/agent-evals |
| Agents SDK tracing | 2026-05-17 | 2026-06-16 | https://developers.openai.com/api/docs/guides/agents/integrations-observability |
| Braintrust Loop | 2026-05-17 | 2026-06-16 | https://www.braintrust.dev/docs/loop |