Ch5. Observability and SLOs
Collect model, tool, and policy execution as traceable signals and operate them through SLOs
Key takeaways
- The real observability standard is time to reconstruct causality, not log volume.
- Capture required signals across request, model, agent, tool, policy, and trace fields so any run is reproducible from its trace_id.
- Define SLIs and SLOs with explicit targets (availability 99.9%, quality ≥ 95%, p95 latency ≤ 4s, policy violation ≤ 0.2%) and track error-budget burn.
- Align internal traces with OpenTelemetry GenAI conventions and OWASP AOS (Instrumentable, Traceable, Inspectable), keeping an adapter layer since both are still evolving.
- Operate a trace-to-eval loop: debug traces, grade them with a rubric, promote repeated failures into datasets, and re-grade online samples to detect drift.
If you cannot quickly reconstruct why a problem happened, observability is not sufficient.
The LLMOps standard is not log volume. It is time to reconstruct causality.
Required Signals
| Area | Required Fields |
|---|---|
| Request | request_id, tenant_id, user_segment, intent, session_id |
| Model | model_id, prompt_version, token_in/out, cached_tokens, reasoning_mode |
| Agent | agent_id, run_id, handoff_from/to, state_version |
| Tool | tool_name, tool_call_id, mcp_server, latency_ms, status, side_effect |
| Policy | policy_pack, guardrail_name, decision, violation_type, approval_id |
| Trace | trace_id, span_id, parent_span_id, eval_score, dataset_version |
Minimal Trace Schema Example
{
"trace_id": "tr_01hx9...",
"run_id": "run_20260517_001",
"tenant_id": "acme-enterprise",
"model": {
"provider": "openai",
"model_id": "gpt-5.4-mini",
"prompt_version": "p-20260517.2",
"input_tokens": 3840,
"cached_input_tokens": 2560,
"output_tokens": 620
},
"tool_calls": [
{
"tool_call_id": "tool_01",
"mcp_server": "github-readonly-prod",
"side_effect": false,
"status": "ok"
}
],
"approval": {
"approval_id": "appr_01",
"decision": "approved"
}
}SLI/SLO Definitions
| SLO Item | Example Target |
|---|---|
| Availability SLO | 99.9% |
| Quality SLO | >= 95% |
| p95 Latency SLO | <= 4 seconds |
| Policy Violation SLO | <= 0.2% |
Dashboard Priorities
- SLO status and burn rate
- Failure distribution by model and prompt
- Policy block and approval ratios
- Top-cost tenants and features
Tracing Practice
- Require request-level distributed traces (
trace_id). - Separate model-call, tool-call, and policy-decision spans.
- Use 100% sampling for high-risk paths and adaptive sampling for normal paths.
OWASP Agent Observability Standard (AOS)
OWASP AOS is an open project for standardizing observability in agent systems. As of May 2026, it is safest to treat it as work in progress. It is organized around three axes:
| Axis | Requirement | Implementation Standard |
|---|---|---|
| Instrumentable | Expose agent and tool calls as instrumentable units | Native MCP + A2A instrumentation |
| Traceable | Trace the full request/response path | OCSF plus OTel integration |
| Inspectable | Make agent components auditable | AI BOM based on CycloneDX, SWID, SPDX |
AOS Adoption
OWASP AOS complements OpenTelemetry GenAI Semantic Conventions. OTel focuses on runtime tracing, while AOS focuses on agent-level audit and security observability. Since AOS is still moving quickly, version your internal trace schema and keep an adapter layer for breaking changes.
2026 Observability Tooling
| Tool | Version | Notes |
|---|---|---|
| Langfuse | v4.0.0 | MIT open source, LLM-as-a-Judge/experiments/playground open, OTel native |
| LangSmith Fleet | Current | Rebranded from Agent Builder, subagent state cards, LangSmith Fetch CLI, unified cost view, experiment baseline pinning |
| Arize Phoenix | v13.0.3 | CLI support for Claude Code/Cursor integration, LDAP auth, open source |
| Braintrust Loop AI | Current | Natural-language scorer generation, Java/Go/Ruby/C# SDKs, OTel native, SOC 2 Type II |
OpenTelemetry GenAI Semantic Conventions
OpenTelemetry GenAI semantic conventions are in Development status as of May 2026.
| Item | Current Status |
|---|---|
| Events (input/output) | GenAI input/output events defined |
| Metrics (tokens/latency) | GenAI operation metrics defined |
| Model spans | Technology-specific conventions for OpenAI, Anthropic, AWS Bedrock, Azure AI Inference, and others |
| Agent spans | GenAI agent/framework spans included |
| MCP spans | MCP semantic conventions included |
| OWASP AOS relationship | AOS Traceable axis references OTel GenAI conventions |
| Vendor adoption | OTel links are expanding across Langfuse, Phoenix, Braintrust, NeMo Guardrails, and others |
OTel Adoption
Mapping OTel-compatible fields into an internal canonical schema makes vendor migration and cross-service traces
easier. When using the latest GenAI convention, record whether OTEL_SEMCONV_STABILITY_OPT_IN is used and which
field version is emitted.
Trace to Eval Operating Loop
| Step | Output |
|---|---|
| Debug trace | Reconstruct which model/tool/handoff/approval ran in a single run |
| Trace grading | Score tool choice, handoff timing, and guardrail activation with a structured rubric |
| Dataset promotion | Store repeated failure traces as regression/eval datasets |
| Online sampling | Apply the same grader to sampled traces after release to detect drift |
Operating Standard
Mean latency is often not useful for operations. Default to p95/p99, top-tenant segments, and policy-failure views.
Baseline and Sources
| Item | Baseline Date | Recheck By | Primary Source |
|---|---|---|---|
| OTel GenAI semantic conventions | 2026-05-17 | 2026-06-16 | https://opentelemetry.io/docs/specs/semconv/gen-ai/ |
| OWASP AOS | 2026-05-17 | 2026-06-16 | https://aos.owasp.org/aos/ |
| Agents SDK tracing | 2026-05-17 | 2026-06-16 | https://developers.openai.com/api/docs/guides/agents/integrations-observability |