Ch8. Incident Management Runbook

Operate a unified standard for quality regressions, cost spikes, and policy bypass incidents

Key takeaways

LLM incidents often look normal while failing in substance, so classify them by type: quality regression, cost spike, policy bypass, MCP/Skill compromise, A2A abuse, and voice/realtime degradation.
Follow a fixed containment order: narrow blast radius, stop side-effect paths (payments, deployments, file writes), revoke tokens and connections, enable fallback, then preserve trace and approval evidence.
Map incidents to unified SEV levels: policy bypass is SEV-1/2 with Security and Compliance, quality regression and cost spikes are SEV-2/3.
Use AI agentic operations (e.g. PagerDuty) to connect observability signals, runbooks, approval boundaries, and escalation, but require human approval for SEV-1.
Postmortems should remove the system conditions that allowed failure rather than assign personal blame.

LLM service incidents often look normal while failing in substance.
Classify incidents by quality, cost, and policy, then standardize immediate actions and recurrence controls.

Incident Types

Type	Detection Signal	Immediate Action
Quality regression	Task Success drops, judge score declines	Roll back prompt/model/tool policy
Cost spike	Unit cost rises, cache hit rate drops	Route to lighter models, limit tool calls, switch to batch
Policy bypass	Violation responses increase, guardrail bypass	Switch to approval mode, hotfix policy pack
MCP/Skill compromise	Shadow server, unapproved scope, abnormal egress	Disable server, revoke token, isolate sandbox
A2A abuse	Webhook SSRF, pre-auth resource exposure	Block peer, stop push notifications
Voice/realtime degradation	first-audio latency, interruption loops	Text fallback, session recreation, low-latency model switch

Postmortem Fields

Detection delay cause (MTTD)
Containment delay cause
Manual steps that can be automated
Controls to prevent the same incident class
Related trace_id, approval_id, MCP server ID, skill version
Customer impact scope and notification needs

Link to Unified Incident Classification

Manage LLMOps incidents together with security incidents:

Quality regression: classify as SEV-2/3 and escalate to ML Platform Lead.
Cost spike: classify as SEV-2/3 and respond jointly with Finance and Platform.
Policy bypass: classify as SEV-1/2 and involve Security and Compliance immediately.

If a data security incident occurs, switch to the SEV-1 security response process immediately.

PagerDuty AI Agentic Operations

In 2026, PagerDuty is expanding its AI integration ecosystem across LLMOps, agent governance, and agentic cloud operations. Operationally, this should not be interpreted as "AI automatically fixes everything." The important shift is connecting observability signals, runbooks, approval boundaries, and escalation into one incident loop.

Capability	Description
Agentic detection	AI agents detect abnormal patterns and classify incident type
Automated recovery	Run predefined isolation and recovery runbooks
Escalation AI	Analyze severity and impact scope, then assign the right response team
Postmortem generation	Draft incident timelines and root-cause summaries

Containment Order

Narrow impact scope: tenant, channel, model version, prompt version, MCP server, skill version.

Stop side-effect paths: payments, refunds, deployments, outbound messages, file writes, shell/code execution.

Revoke tokens and connections: MCP/A2A credentials, webhook secrets, long-lived API keys.

Enable fallback: previous version, restricted response, text-only, human review, read-only mode.

Preserve traces and approval evidence, then create postmortem and regression evals.

Credential Rotation and A2A Blocking Example

containment_playbook:
  trigger: mcp_or_a2a_compromise
  steps:
    - disable_mcp_server: github-readonly-prod
    - revoke_token_audience: mcp://github-readonly-prod
    - block_a2a_peer:
        agent_card_url: https://partner.example.com/.well-known/agent-card.json
        reason: webhook_ssrf_attempt
    - rotate_webhook_secret: a2a_push_notifications
    - set_runtime_mode: read_only
    - preserve_evidence:
        - trace_id
        - approval_id
        - mcp_server_logs
        - webhook_request_headers

Automated Recovery Standard

Apply automated recovery first to incidents with limited blast radius, such as single-tenant issues or lightweight model fallback. SEV-1 incidents should require human approval.

Principle

Postmortems should focus on removing the system conditions that allowed failure, not assigning personal blame.

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OWASP MCP incident risks	2026-05-17	2026-06-16	https://owasp.org/www-project-mcp-top-10/
OWASP Agentic Skills incident risks	2026-05-17	2026-06-16	https://owasp.org/www-project-agentic-skills-top-10/
PagerDuty AI operations ecosystem	2026-05-17	2026-06-16	https://www.pagerduty.com/newsroom/pagerduty-expands-ai-ecosystem-to-supercharge-ai-agents/

Key takeaways

LLM incidents often look normal while failing in substance, so classify them by type: quality regression, cost spike, policy bypass, MCP/Skill compromise, A2A abuse, and voice/realtime degradation.
Follow a fixed containment order: narrow blast radius, stop side-effect paths (payments, deployments, file writes), revoke tokens and connections, enable fallback, then preserve trace and approval evidence.
Map incidents to unified SEV levels: policy bypass is SEV-1/2 with Security and Compliance, quality regression and cost spikes are SEV-2/3.
Use AI agentic operations (e.g. PagerDuty) to connect observability signals, runbooks, approval boundaries, and escalation, but require human approval for SEV-1.
Postmortems should remove the system conditions that allowed failure rather than assign personal blame.

LLM service incidents often look normal while failing in substance.
Classify incidents by quality, cost, and policy, then standardize immediate actions and recurrence controls.

Incident Types

Type	Detection Signal	Immediate Action
Quality regression	Task Success drops, judge score declines	Roll back prompt/model/tool policy
Cost spike	Unit cost rises, cache hit rate drops	Route to lighter models, limit tool calls, switch to batch
Policy bypass	Violation responses increase, guardrail bypass	Switch to approval mode, hotfix policy pack
MCP/Skill compromise	Shadow server, unapproved scope, abnormal egress	Disable server, revoke token, isolate sandbox
A2A abuse	Webhook SSRF, pre-auth resource exposure	Block peer, stop push notifications
Voice/realtime degradation	first-audio latency, interruption loops	Text fallback, session recreation, low-latency model switch

Response Flow

Postmortem Fields

Detection delay cause (MTTD)
Containment delay cause
Manual steps that can be automated
Controls to prevent the same incident class
Related trace_id, approval_id, MCP server ID, skill version
Customer impact scope and notification needs

Link to Unified Incident Classification

Manage LLMOps incidents together with security incidents:

Quality regression: classify as SEV-2/3 and escalate to ML Platform Lead.
Cost spike: classify as SEV-2/3 and respond jointly with Finance and Platform.
Policy bypass: classify as SEV-1/2 and involve Security and Compliance immediately.

If a data security incident occurs, switch to the SEV-1 security response process immediately.

PagerDuty AI Agentic Operations

Capability	Description
Agentic detection	AI agents detect abnormal patterns and classify incident type
Automated recovery	Run predefined isolation and recovery runbooks
Escalation AI	Analyze severity and impact scope, then assign the right response team
Postmortem generation	Draft incident timelines and root-cause summaries

Containment Order

Narrow impact scope: tenant, channel, model version, prompt version, MCP server, skill version.

Stop side-effect paths: payments, refunds, deployments, outbound messages, file writes, shell/code execution.

Revoke tokens and connections: MCP/A2A credentials, webhook secrets, long-lived API keys.

Enable fallback: previous version, restricted response, text-only, human review, read-only mode.

Preserve traces and approval evidence, then create postmortem and regression evals.

Credential Rotation and A2A Blocking Example

containment_playbook:
  trigger: mcp_or_a2a_compromise
  steps:
    - disable_mcp_server: github-readonly-prod
    - revoke_token_audience: mcp://github-readonly-prod
    - block_a2a_peer:
        agent_card_url: https://partner.example.com/.well-known/agent-card.json
        reason: webhook_ssrf_attempt
    - rotate_webhook_secret: a2a_push_notifications
    - set_runtime_mode: read_only
    - preserve_evidence:
        - trace_id
        - approval_id
        - mcp_server_logs
        - webhook_request_headers

Automated Recovery Standard

Apply automated recovery first to incidents with limited blast radius, such as single-tenant issues or lightweight model fallback. SEV-1 incidents should require human approval.

Principle

Postmortems should focus on removing the system conditions that allowed failure, not assigning personal blame.

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OWASP MCP incident risks	2026-05-17	2026-06-16	https://owasp.org/www-project-mcp-top-10/
OWASP Agentic Skills incident risks	2026-05-17	2026-06-16	https://owasp.org/www-project-agentic-skills-top-10/
PagerDuty AI operations ecosystem	2026-05-17	2026-06-16	https://www.pagerduty.com/newsroom/pagerduty-expands-ai-ecosystem-to-supercharge-ai-agents/

Ch8. Incident Management Runbook

Incident Types

Response Flow

Postmortem Fields

Link to Unified Incident Classification

PagerDuty AI Agentic Operations

Containment Order

Credential Rotation and A2A Blocking Example

Baseline and Sources

On This Page

Ch8. Incident Management Runbook

Incident Types

Response Flow

Postmortem Fields

Link to Unified Incident Classification

PagerDuty AI Agentic Operations

Containment Order

Credential Rotation and A2A Blocking Example

Baseline and Sources

On This Page