Runbooks and Operations
Design runbooks that separate diagnosis, recommendation, action, approval, and rollback.
Key takeaways
- Runbooks must be structured enough for an agent to diagnose and recommend safely while keeping risky actions behind approval boundaries.
- Three modes separate agent authority: Diagnose (read-only) and Recommend are default-on, while Act runs only when approved.
- Every action needs a numeric, observable condition, an explicit mode, a command, a verification check, and a rollback/recovery path.
- Map MCP Tools to modes and approval: read-only tools auto-run, while destructive tools like
rollback_releaseorrotate_secretrequire human or change approval. - MCP annotations are hints, not enforcement; enforce permissions with scopes, server-side auth, approval policy, and audit logs.
Runbooks make operational response repeatable. In an agentic setting, they must be structured enough for an agent to diagnose and recommend safely, while keeping risky actions behind approval boundaries.
Three Modes
| Mode | Agent authority | Example | Default |
|---|---|---|---|
| Diagnose | read-only | inspect logs, metrics, status | yes |
| Recommend | propose action | suggest rollback or scale-up | yes |
| Act | execute limited action | purge cache, toggle safe flag | only when approved |
Runbook Flow
Action Format
### Action: restart API worker
- **Condition**: p95 latency > 3s for 5 minutes and error rate < 1%
- **Mode**: Act allowed
- **Command**: `kubectl rollout restart deployment/api-worker`
- **Verify**: p95 latency < 1s within 5 minutes
- **Rollback/recovery**: `kubectl rollout undo deployment/api-worker`
- **Escalate**: call L1 if not recovered within 10 minutesMCP Tool Boundary
| Tool | Mode | Annotation hint | Approval |
|---|---|---|---|
get_metrics | Diagnose | readOnlyHint: true | auto |
list_deployments | Diagnose | readOnlyHint: true | auto |
purge_cache | Act | idempotentHint: true | policy-dependent |
rollback_release | Act | destructiveHint: true | human approval |
rotate_secret | Act | destructiveHint: true | change approval |
MCP annotations are hints, not enforcement. Enforce permissions with scopes, server-side auth, approval policy, and audit logs.
Template
---
title: "Runbook: {service} - {incident}"
severity: "sev2"
services: ["{service}"]
owner: "{team}"
last_tested: "YYYY-MM-DD"
allowed_modes: ["diagnose", "recommend"]
---
## Trigger
- Alert:
- Condition:
## Diagnosis
| Step | Signal | Command/tool | Healthy threshold |
|---|---|---|---|
## Actions
### Action 1
- Condition:
- Mode:
- Command:
- Verify:
- Rollback/recovery:
- Approval required:
## Escalation
| Condition | Owner | Channel | SLA |
|---|---|---|---|Checklist
| Item | Check |
|---|---|
| trigger is numeric and observable | [ ] |
| diagnosis is read-only | [ ] |
| actions have modes | [ ] |
| destructive actions require approval | [ ] |
| each action has verification and recovery | [ ] |
| escalation and SLA are explicit | [ ] |