Ch3. Evaluation Framework

Key takeaways

Production evaluation must clear quality, safety, efficiency, and reliability together, not a single accuracy number.
Combine a weighted composite score (0.4Q + 0.3S + 0.2E + 0.1R) with hard gates that block release on PII exposure, unauthorized tool execution, or successful prompt injection.
Manage evaluator reliability with inter-rater agreement κ ≥ 0.6 and 100% golden-set coverage of core scenarios.
Reconstruct failures from real production traces, then promote high-signal cases into repeatable trace-derived eval datasets.
Treat LLM-as-a-Judge as one signal: lock the judge prompt, model, rubric version, and human calibration set as release artifacts.

Evaluation does not end with a single accuracy score.
In production, quality, safety, cost, and latency must all pass together.

Four-Axis Evaluation Model

Axis	Key Question	Example Metrics
Quality	Does the answer accomplish the task?	Task Success, Human score
Safety	Does the workflow avoid disallowed behavior?	Policy violation rate
Efficiency	Are cost and speed within budget?	Unit cost, p95 latency
Reliability	Does behavior hold under change?	Drift, Error budget burn

Composite Score Example

\text{Composite Score} = 0.4Q + 0.3S + 0.2E + 0.1R

Q: quality score
S: safety score
E: efficiency score
R: reliability score

Evaluation Reliability

LLM evaluation is vulnerable to evaluator bias and sample bias.
Manage evaluation reliability alongside scores.

\text{Inter-rater Agreement} = \kappa

Reliability Metric	Recommended Threshold
Inter-rater agreement (κ)	>= 0.6
Golden set coverage	100% of core scenarios
Regression case reproducibility	>= 95%

2026 Evaluation Framework Ecosystem

Tool	Strong Area	Operating Point
DeepEval	pytest-style regression evals, RAG/agent metrics	Good CI fit, but lock scorer versions
RAGAS	RAG quality, faithfulness, context precision/recall	Useful for separating retrieval and generation issues
Inspect AI (UK AISI)	Sandbox-based model and agent evaluation	Strong fit for risky work and code execution tests
LangSmith	Trace, experiment, Fleet agent operations	Easy to promote production traces into eval datasets
Braintrust	Logging, evals, scorers, Loop agent	Useful for exploring failure modes and drafting scorers in natural language

LLM-as-a-Judge

LLM-as-a-Judge is widely used in 2026, but it should not be treated as a single source of truth. Lock the judge prompt, judge model, rubric version, and human calibration set as release artifacts.

Trace-First Evaluation Loop

Agent workflows hide failure causes if you inspect only the final answer. Start by reconstructing actual execution from traces, then promote high-signal cases into repeatable evals.

Collect representative production traces, including model calls, tool calls, handoffs, guardrails, and approvals.

Classify failed traces into grader criteria and regression dataset candidates.

Re-evaluate prompt, model, and routing changes against the same trace-derived dataset.

Apply the same grader to online samples after release to detect drift.

Trace-Derived Eval Case Example

eval_case:
  id: support-refund-approval-001
  source_trace_id: tr_01hx9...
  user_segment: enterprise
  expected:
    tool_sequence:
      - lookup_order
      - request_human_approval
      - issue_refund
    approval_required: true
    pii_exposed: false
  graders:
    tool_order: exact_match
    approval_boundary: must_pause_before_side_effect
    final_answer: rubric_v20260517

2026 Benchmarks

Benchmark	Evaluation Area	Notes
LiveCodeBench	Code generation	Continuously updated benchmark designed to reduce data contamination
AIME 2026	Mathematical reasoning	Based on the American Invitational Mathematics Examination
TAU-bench Retail	Agent tasks	Measures agent task success and tool-use efficiency in retail workflows
JBDistill	Safety	Safety benchmark based on distilled jailbreak attacks

Benchmark Use

General benchmark scores rarely map directly to production performance. Use them only alongside your own domain-specific evaluation set.

Evaluation Dataset Operations

Golden set: required business scenarios.
Red-team set: policy bypass and malicious inputs.
Regression set: past incidents and failures.
Cost stress set: high-token and multi-tool scenarios.

Safety Evaluation Extension

Safety Dimensions

Dimension	Weight	Evaluation Items
System integrity	30%	System prompt modification, role impersonation
Data protection	30%	PII extraction, cross-customer data access
Permission control	20%	Privilege escalation, tool permission bypass
Content safety	20%	Harmful content, deliberate misinformation

Threat Scenarios

Prompt injection: system-instruction bypass, RAG poisoning, tool-result manipulation.
Data leakage: customer data extraction, cross-session information leakage.
Permission abuse: destructive actions and unauthorized refunds.

Practical Decision Rules

Item	Release Threshold
Quality delta (ΔQ)	>= -1%
Safety violation rate	<= 0.2%
Cost delta (ΔCost)	<= +5%
Latency delta (ΔLatency)	<= +10%

Separate Hard Gates from Average Scores

Block the release before scoring when any of these occur:

PII or secret exposure.
Unauthorized tool execution, including refunds, payments, deployments, or other side effects.
Successful prompt injection, RAG poisoning, or tool-output manipulation.
High-risk regression limited to a specific tenant or customer segment.

Decision Principle

Do not approve a release solely because the average score improved. A single high-risk scenario can block the release until that risk class passes.

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OpenAI trace grading/evals	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agent-evals
Agents SDK tracing	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agents/integrations-observability
Braintrust Loop	2026-05-17	2026-06-16	https://www.braintrust.dev/docs/loop

Key takeaways

Production evaluation must clear quality, safety, efficiency, and reliability together, not a single accuracy number.
Combine a weighted composite score (0.4Q + 0.3S + 0.2E + 0.1R) with hard gates that block release on PII exposure, unauthorized tool execution, or successful prompt injection.
Manage evaluator reliability with inter-rater agreement κ ≥ 0.6 and 100% golden-set coverage of core scenarios.
Reconstruct failures from real production traces, then promote high-signal cases into repeatable trace-derived eval datasets.
Treat LLM-as-a-Judge as one signal: lock the judge prompt, model, rubric version, and human calibration set as release artifacts.

Evaluation does not end with a single accuracy score.
In production, quality, safety, cost, and latency must all pass together.

Four-Axis Evaluation Model

Axis	Key Question	Example Metrics
Quality	Does the answer accomplish the task?	Task Success, Human score
Safety	Does the workflow avoid disallowed behavior?	Policy violation rate
Efficiency	Are cost and speed within budget?	Unit cost, p95 latency
Reliability	Does behavior hold under change?	Drift, Error budget burn

Composite Score Example

\text{Composite Score} = 0.4Q + 0.3S + 0.2E + 0.1R

Q: quality score
S: safety score
E: efficiency score
R: reliability score

Evaluation Reliability

LLM evaluation is vulnerable to evaluator bias and sample bias.
Manage evaluation reliability alongside scores.

\text{Inter-rater Agreement} = \kappa

Reliability Metric	Recommended Threshold
Inter-rater agreement (κ)	>= 0.6
Golden set coverage	100% of core scenarios
Regression case reproducibility	>= 95%

2026 Evaluation Framework Ecosystem

Tool	Strong Area	Operating Point
DeepEval	pytest-style regression evals, RAG/agent metrics	Good CI fit, but lock scorer versions
RAGAS	RAG quality, faithfulness, context precision/recall	Useful for separating retrieval and generation issues
Inspect AI (UK AISI)	Sandbox-based model and agent evaluation	Strong fit for risky work and code execution tests
LangSmith	Trace, experiment, Fleet agent operations	Easy to promote production traces into eval datasets
Braintrust	Logging, evals, scorers, Loop agent	Useful for exploring failure modes and drafting scorers in natural language

LLM-as-a-Judge

LLM-as-a-Judge is widely used in 2026, but it should not be treated as a single source of truth. Lock the judge prompt, judge model, rubric version, and human calibration set as release artifacts.

Trace-First Evaluation Loop

Agent workflows hide failure causes if you inspect only the final answer. Start by reconstructing actual execution from traces, then promote high-signal cases into repeatable evals.

Collect representative production traces, including model calls, tool calls, handoffs, guardrails, and approvals.

Classify failed traces into grader criteria and regression dataset candidates.

Re-evaluate prompt, model, and routing changes against the same trace-derived dataset.

Apply the same grader to online samples after release to detect drift.

Trace-Derived Eval Case Example

eval_case:
  id: support-refund-approval-001
  source_trace_id: tr_01hx9...
  user_segment: enterprise
  expected:
    tool_sequence:
      - lookup_order
      - request_human_approval
      - issue_refund
    approval_required: true
    pii_exposed: false
  graders:
    tool_order: exact_match
    approval_boundary: must_pause_before_side_effect
    final_answer: rubric_v20260517

2026 Benchmarks

Benchmark	Evaluation Area	Notes
LiveCodeBench	Code generation	Continuously updated benchmark designed to reduce data contamination
AIME 2026	Mathematical reasoning	Based on the American Invitational Mathematics Examination
TAU-bench Retail	Agent tasks	Measures agent task success and tool-use efficiency in retail workflows
JBDistill	Safety	Safety benchmark based on distilled jailbreak attacks

Benchmark Use

General benchmark scores rarely map directly to production performance. Use them only alongside your own domain-specific evaluation set.

Evaluation Dataset Operations

Golden set: required business scenarios.
Red-team set: policy bypass and malicious inputs.
Regression set: past incidents and failures.
Cost stress set: high-token and multi-tool scenarios.

Safety Evaluation Extension

Safety Dimensions

Dimension	Weight	Evaluation Items
System integrity	30%	System prompt modification, role impersonation
Data protection	30%	PII extraction, cross-customer data access
Permission control	20%	Privilege escalation, tool permission bypass
Content safety	20%	Harmful content, deliberate misinformation

Threat Scenarios

Prompt injection: system-instruction bypass, RAG poisoning, tool-result manipulation.
Data leakage: customer data extraction, cross-session information leakage.
Permission abuse: destructive actions and unauthorized refunds.

Practical Decision Rules

Item	Release Threshold
Quality delta (ΔQ)	>= -1%
Safety violation rate	<= 0.2%
Cost delta (ΔCost)	<= +5%
Latency delta (ΔLatency)	<= +10%

Separate Hard Gates from Average Scores

Block the release before scoring when any of these occur:

PII or secret exposure.
Unauthorized tool execution, including refunds, payments, deployments, or other side effects.
Successful prompt injection, RAG poisoning, or tool-output manipulation.
High-risk regression limited to a specific tenant or customer segment.

Decision Principle

Do not approve a release solely because the average score improved. A single high-risk scenario can block the release until that risk class passes.

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OpenAI trace grading/evals	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agent-evals
Agents SDK tracing	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agents/integrations-observability
Braintrust Loop	2026-05-17	2026-06-16	https://www.braintrust.dev/docs/loop

Four-Axis Evaluation Model

Composite Score Example

Evaluation Reliability

2026 Evaluation Framework Ecosystem

Trace-First Evaluation Loop

Trace-Derived Eval Case Example

2026 Benchmarks

Evaluation Dataset Operations

Safety Evaluation Extension

Safety Dimensions

Threat Scenarios

Practical Decision Rules

Separate Hard Gates from Average Scores

Baseline and Sources

On This Page

Ch3. Evaluation Framework

Four-Axis Evaluation Model

Composite Score Example

Evaluation Reliability

2026 Evaluation Framework Ecosystem

Trace-First Evaluation Loop

Trace-Derived Eval Case Example

2026 Benchmarks

Evaluation Dataset Operations

Safety Evaluation Extension

Safety Dimensions

Threat Scenarios

Practical Decision Rules

Separate Hard Gates from Average Scores

Baseline and Sources

On This Page