Ch5. Observability and SLOs

Collect model, tool, and policy execution as traceable signals and operate them through SLOs

Key takeaways

The real observability standard is time to reconstruct causality, not log volume.
Capture required signals across request, model, agent, tool, policy, and trace fields so any run is reproducible from its trace_id.
Define SLIs and SLOs with explicit targets (availability 99.9%, quality ≥ 95%, p95 latency ≤ 4s, policy violation ≤ 0.2%) and track error-budget burn.
Align internal traces with OpenTelemetry GenAI conventions and OWASP AOS (Instrumentable, Traceable, Inspectable), keeping an adapter layer since both are still evolving.
Operate a trace-to-eval loop: debug traces, grade them with a rubric, promote repeated failures into datasets, and re-grade online samples to detect drift.

If you cannot quickly reconstruct why a problem happened, observability is not sufficient.
The LLMOps standard is not log volume. It is time to reconstruct causality.

Required Signals

Area	Required Fields
Request	request_id, tenant_id, user_segment, intent, session_id
Model	model_id, prompt_version, token_in/out, cached_tokens, reasoning_mode
Agent	agent_id, run_id, handoff_from/to, state_version
Tool	tool_name, tool_call_id, mcp_server, latency_ms, status, side_effect
Policy	policy_pack, guardrail_name, decision, violation_type, approval_id
Trace	trace_id, span_id, parent_span_id, eval_score, dataset_version

Minimal Trace Schema Example

{
  "trace_id": "tr_01hx9...",
  "run_id": "run_20260517_001",
  "tenant_id": "acme-enterprise",
  "model": {
    "provider": "openai",
    "model_id": "gpt-5.4-mini",
    "prompt_version": "p-20260517.2",
    "input_tokens": 3840,
    "cached_input_tokens": 2560,
    "output_tokens": 620
  },
  "tool_calls": [
    {
      "tool_call_id": "tool_01",
      "mcp_server": "github-readonly-prod",
      "side_effect": false,
      "status": "ok"
    }
  ],
  "approval": {
    "approval_id": "appr_01",
    "decision": "approved"
  }
}

SLI/SLO Definitions

\text{Availability SLI} = \frac{\text{Successful Requests}}{\text{Total Requests}}

\text{Quality SLI} = \frac{\text{Successful Tasks}}{\text{Evaluated Tasks}}

SLO Item	Example Target
Availability SLO	99.9%
Quality SLO	>= 95%
p95 Latency SLO	<= 4 seconds
Policy Violation SLO	<= 0.2%

\text{Error Budget} = (1 - \text{SLO}) \times \text{Total Requests}

Dashboard Priorities

SLO status and burn rate
Failure distribution by model and prompt
Policy block and approval ratios
Top-cost tenants and features

Tracing Practice

Require request-level distributed traces (trace_id).
Separate model-call, tool-call, and policy-decision spans.
Use 100% sampling for high-risk paths and adaptive sampling for normal paths.

OWASP Agent Observability Standard (AOS)

OWASP AOS is an open project for standardizing observability in agent systems. As of May 2026, it is safest to treat it as work in progress. It is organized around three axes:

Axis	Requirement	Implementation Standard
Instrumentable	Expose agent and tool calls as instrumentable units	Native MCP + A2A instrumentation
Traceable	Trace the full request/response path	OCSF plus OTel integration
Inspectable	Make agent components auditable	AI BOM based on CycloneDX, SWID, SPDX

AOS Adoption

OWASP AOS complements OpenTelemetry GenAI Semantic Conventions. OTel focuses on runtime tracing, while AOS focuses on agent-level audit and security observability. Since AOS is still moving quickly, version your internal trace schema and keep an adapter layer for breaking changes.

2026 Observability Tooling

Tool	Version	Notes
Langfuse	v4.0.0	MIT open source, LLM-as-a-Judge/experiments/playground open, OTel native
LangSmith Fleet	Current	Rebranded from Agent Builder, subagent state cards, LangSmith Fetch CLI, unified cost view, experiment baseline pinning
Arize Phoenix	v13.0.3	CLI support for Claude Code/Cursor integration, LDAP auth, open source
Braintrust Loop AI	Current	Natural-language scorer generation, Java/Go/Ruby/C# SDKs, OTel native, SOC 2 Type II

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry GenAI semantic conventions are in Development status as of May 2026.

Item	Current Status
Events (input/output)	GenAI input/output events defined
Metrics (tokens/latency)	GenAI operation metrics defined
Model spans	Technology-specific conventions for OpenAI, Anthropic, AWS Bedrock, Azure AI Inference, and others
Agent spans	GenAI agent/framework spans included
MCP spans	MCP semantic conventions included
OWASP AOS relationship	AOS Traceable axis references OTel GenAI conventions
Vendor adoption	OTel links are expanding across Langfuse, Phoenix, Braintrust, NeMo Guardrails, and others

OTel Adoption

Mapping OTel-compatible fields into an internal canonical schema makes vendor migration and cross-service traces easier. When using the latest GenAI convention, record whether OTEL_SEMCONV_STABILITY_OPT_IN is used and which field version is emitted.

Trace to Eval Operating Loop

Step	Output
Debug trace	Reconstruct which model/tool/handoff/approval ran in a single run
Trace grading	Score tool choice, handoff timing, and guardrail activation with a structured rubric
Dataset promotion	Store repeated failure traces as regression/eval datasets
Online sampling	Apply the same grader to sampled traces after release to detect drift

Operating Standard

Mean latency is often not useful for operations. Default to p95/p99, top-tenant segments, and policy-failure views.

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OTel GenAI semantic conventions	2026-05-17	2026-06-16	https://opentelemetry.io/docs/specs/semconv/gen-ai/
OWASP AOS	2026-05-17	2026-06-16	https://aos.owasp.org/aos/
Agents SDK tracing	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agents/integrations-observability

Key takeaways

The real observability standard is time to reconstruct causality, not log volume.
Capture required signals across request, model, agent, tool, policy, and trace fields so any run is reproducible from its trace_id.
Define SLIs and SLOs with explicit targets (availability 99.9%, quality ≥ 95%, p95 latency ≤ 4s, policy violation ≤ 0.2%) and track error-budget burn.
Align internal traces with OpenTelemetry GenAI conventions and OWASP AOS (Instrumentable, Traceable, Inspectable), keeping an adapter layer since both are still evolving.
Operate a trace-to-eval loop: debug traces, grade them with a rubric, promote repeated failures into datasets, and re-grade online samples to detect drift.

If you cannot quickly reconstruct why a problem happened, observability is not sufficient.
The LLMOps standard is not log volume. It is time to reconstruct causality.

Required Signals

Area	Required Fields
Request	request_id, tenant_id, user_segment, intent, session_id
Model	model_id, prompt_version, token_in/out, cached_tokens, reasoning_mode
Agent	agent_id, run_id, handoff_from/to, state_version
Tool	tool_name, tool_call_id, mcp_server, latency_ms, status, side_effect
Policy	policy_pack, guardrail_name, decision, violation_type, approval_id
Trace	trace_id, span_id, parent_span_id, eval_score, dataset_version

Minimal Trace Schema Example

{
  "trace_id": "tr_01hx9...",
  "run_id": "run_20260517_001",
  "tenant_id": "acme-enterprise",
  "model": {
    "provider": "openai",
    "model_id": "gpt-5.4-mini",
    "prompt_version": "p-20260517.2",
    "input_tokens": 3840,
    "cached_input_tokens": 2560,
    "output_tokens": 620
  },
  "tool_calls": [
    {
      "tool_call_id": "tool_01",
      "mcp_server": "github-readonly-prod",
      "side_effect": false,
      "status": "ok"
    }
  ],
  "approval": {
    "approval_id": "appr_01",
    "decision": "approved"
  }
}

SLI/SLO Definitions

\text{Availability SLI} = \frac{\text{Successful Requests}}{\text{Total Requests}}

\text{Quality SLI} = \frac{\text{Successful Tasks}}{\text{Evaluated Tasks}}

SLO Item	Example Target
Availability SLO	99.9%
Quality SLO	>= 95%
p95 Latency SLO	<= 4 seconds
Policy Violation SLO	<= 0.2%

\text{Error Budget} = (1 - \text{SLO}) \times \text{Total Requests}

Dashboard Priorities

SLO status and burn rate
Failure distribution by model and prompt
Policy block and approval ratios
Top-cost tenants and features

Tracing Practice

Require request-level distributed traces (trace_id).
Separate model-call, tool-call, and policy-decision spans.
Use 100% sampling for high-risk paths and adaptive sampling for normal paths.

OWASP Agent Observability Standard (AOS)

OWASP AOS is an open project for standardizing observability in agent systems. As of May 2026, it is safest to treat it as work in progress. It is organized around three axes:

Axis	Requirement	Implementation Standard
Instrumentable	Expose agent and tool calls as instrumentable units	Native MCP + A2A instrumentation
Traceable	Trace the full request/response path	OCSF plus OTel integration
Inspectable	Make agent components auditable	AI BOM based on CycloneDX, SWID, SPDX

AOS Adoption

2026 Observability Tooling

Tool	Version	Notes
Langfuse	v4.0.0	MIT open source, LLM-as-a-Judge/experiments/playground open, OTel native
LangSmith Fleet	Current	Rebranded from Agent Builder, subagent state cards, LangSmith Fetch CLI, unified cost view, experiment baseline pinning
Arize Phoenix	v13.0.3	CLI support for Claude Code/Cursor integration, LDAP auth, open source
Braintrust Loop AI	Current	Natural-language scorer generation, Java/Go/Ruby/C# SDKs, OTel native, SOC 2 Type II

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry GenAI semantic conventions are in Development status as of May 2026.

Item	Current Status
Events (input/output)	GenAI input/output events defined
Metrics (tokens/latency)	GenAI operation metrics defined
Model spans	Technology-specific conventions for OpenAI, Anthropic, AWS Bedrock, Azure AI Inference, and others
Agent spans	GenAI agent/framework spans included
MCP spans	MCP semantic conventions included
OWASP AOS relationship	AOS Traceable axis references OTel GenAI conventions
Vendor adoption	OTel links are expanding across Langfuse, Phoenix, Braintrust, NeMo Guardrails, and others

OTel Adoption

Trace to Eval Operating Loop

Step	Output
Debug trace	Reconstruct which model/tool/handoff/approval ran in a single run
Trace grading	Score tool choice, handoff timing, and guardrail activation with a structured rubric
Dataset promotion	Store repeated failure traces as regression/eval datasets
Online sampling	Apply the same grader to sampled traces after release to detect drift

Operating Standard

Mean latency is often not useful for operations. Default to p95/p99, top-tenant segments, and policy-failure views.

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OTel GenAI semantic conventions	2026-05-17	2026-06-16	https://opentelemetry.io/docs/specs/semconv/gen-ai/
OWASP AOS	2026-05-17	2026-06-16	https://aos.owasp.org/aos/
Agents SDK tracing	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agents/integrations-observability

Required Signals

Minimal Trace Schema Example

SLI/SLO Definitions

Dashboard Priorities

Tracing Practice

OWASP Agent Observability Standard (AOS)

2026 Observability Tooling

OpenTelemetry GenAI Semantic Conventions

Trace to Eval Operating Loop

Baseline and Sources

On This Page

Ch5. Observability and SLOs

Required Signals

Minimal Trace Schema Example

SLI/SLO Definitions

Dashboard Priorities

Tracing Practice

OWASP Agent Observability Standard (AOS)

2026 Observability Tooling

OpenTelemetry GenAI Semantic Conventions

Trace to Eval Operating Loop

Baseline and Sources

On This Page