Ch8. Incident Management Runbook
Operate a unified standard for quality regressions, cost spikes, and policy bypass incidents
Key takeaways
- LLM incidents often look normal while failing in substance, so classify them by type: quality regression, cost spike, policy bypass, MCP/Skill compromise, A2A abuse, and voice/realtime degradation.
- Follow a fixed containment order: narrow blast radius, stop side-effect paths (payments, deployments, file writes), revoke tokens and connections, enable fallback, then preserve trace and approval evidence.
- Map incidents to unified SEV levels: policy bypass is SEV-1/2 with Security and Compliance, quality regression and cost spikes are SEV-2/3.
- Use AI agentic operations (e.g. PagerDuty) to connect observability signals, runbooks, approval boundaries, and escalation, but require human approval for SEV-1.
- Postmortems should remove the system conditions that allowed failure rather than assign personal blame.
LLM service incidents often look normal while failing in substance.
Classify incidents by quality, cost, and policy, then standardize immediate actions and recurrence controls.
Incident Types
| Type | Detection Signal | Immediate Action |
|---|---|---|
| Quality regression | Task Success drops, judge score declines | Roll back prompt/model/tool policy |
| Cost spike | Unit cost rises, cache hit rate drops | Route to lighter models, limit tool calls, switch to batch |
| Policy bypass | Violation responses increase, guardrail bypass | Switch to approval mode, hotfix policy pack |
| MCP/Skill compromise | Shadow server, unapproved scope, abnormal egress | Disable server, revoke token, isolate sandbox |
| A2A abuse | Webhook SSRF, pre-auth resource exposure | Block peer, stop push notifications |
| Voice/realtime degradation | first-audio latency, interruption loops | Text fallback, session recreation, low-latency model switch |
Response Flow
Postmortem Fields
- Detection delay cause (MTTD)
- Containment delay cause
- Manual steps that can be automated
- Controls to prevent the same incident class
- Related trace_id, approval_id, MCP server ID, skill version
- Customer impact scope and notification needs
Link to Unified Incident Classification
Manage LLMOps incidents together with security incidents:
- Quality regression: classify as SEV-2/3 and escalate to ML Platform Lead.
- Cost spike: classify as SEV-2/3 and respond jointly with Finance and Platform.
- Policy bypass: classify as SEV-1/2 and involve Security and Compliance immediately.
If a data security incident occurs, switch to the SEV-1 security response process immediately.
PagerDuty AI Agentic Operations
In 2026, PagerDuty is expanding its AI integration ecosystem across LLMOps, agent governance, and agentic cloud operations. Operationally, this should not be interpreted as "AI automatically fixes everything." The important shift is connecting observability signals, runbooks, approval boundaries, and escalation into one incident loop.
| Capability | Description |
|---|---|
| Agentic detection | AI agents detect abnormal patterns and classify incident type |
| Automated recovery | Run predefined isolation and recovery runbooks |
| Escalation AI | Analyze severity and impact scope, then assign the right response team |
| Postmortem generation | Draft incident timelines and root-cause summaries |
Containment Order
Credential Rotation and A2A Blocking Example
containment_playbook:
trigger: mcp_or_a2a_compromise
steps:
- disable_mcp_server: github-readonly-prod
- revoke_token_audience: mcp://github-readonly-prod
- block_a2a_peer:
agent_card_url: https://partner.example.com/.well-known/agent-card.json
reason: webhook_ssrf_attempt
- rotate_webhook_secret: a2a_push_notifications
- set_runtime_mode: read_only
- preserve_evidence:
- trace_id
- approval_id
- mcp_server_logs
- webhook_request_headersAutomated Recovery Standard
Apply automated recovery first to incidents with limited blast radius, such as single-tenant issues or lightweight model fallback. SEV-1 incidents should require human approval.
Principle
Postmortems should focus on removing the system conditions that allowed failure, not assigning personal blame.
Baseline and Sources
| Item | Baseline Date | Recheck By | Primary Source |
|---|---|---|---|
| OWASP MCP incident risks | 2026-05-17 | 2026-06-16 | https://owasp.org/www-project-mcp-top-10/ |
| OWASP Agentic Skills incident risks | 2026-05-17 | 2026-06-16 | https://owasp.org/www-project-agentic-skills-top-10/ |
| PagerDuty AI operations ecosystem | 2026-05-17 | 2026-06-16 | https://www.pagerduty.com/newsroom/pagerduty-expands-ai-ecosystem-to-supercharge-ai-agents/ |