Runbooks and Operations

Design runbooks that separate diagnosis, recommendation, action, approval, and rollback.

Key takeaways

Runbooks must be structured enough for an agent to diagnose and recommend safely while keeping risky actions behind approval boundaries.
Three modes separate agent authority: Diagnose (read-only) and Recommend are default-on, while Act runs only when approved.
Every action needs a numeric, observable condition, an explicit mode, a command, a verification check, and a rollback/recovery path.
Map MCP Tools to modes and approval: read-only tools auto-run, while destructive tools like rollback_release or rotate_secret require human or change approval.
MCP annotations are hints, not enforcement; enforce permissions with scopes, server-side auth, approval policy, and audit logs.

Runbooks make operational response repeatable. In an agentic setting, they must be structured enough for an agent to diagnose and recommend safely, while keeping risky actions behind approval boundaries.

Three Modes

Mode	Agent authority	Example	Default
Diagnose	read-only	inspect logs, metrics, status	yes
Recommend	propose action	suggest rollback or scale-up	yes
Act	execute limited action	purge cache, toggle safe flag	only when approved

Runbook Flow

Action Format

### Action: restart API worker

- **Condition**: p95 latency > 3s for 5 minutes and error rate < 1%
- **Mode**: Act allowed
- **Command**: `kubectl rollout restart deployment/api-worker`
- **Verify**: p95 latency < 1s within 5 minutes
- **Rollback/recovery**: `kubectl rollout undo deployment/api-worker`
- **Escalate**: call L1 if not recovered within 10 minutes

MCP Tool Boundary

Tool	Mode	Annotation hint	Approval
`get_metrics`	Diagnose	`readOnlyHint: true`	auto
`list_deployments`	Diagnose	`readOnlyHint: true`	auto
`purge_cache`	Act	`idempotentHint: true`	policy-dependent
`rollback_release`	Act	`destructiveHint: true`	human approval
`rotate_secret`	Act	`destructiveHint: true`	change approval

MCP annotations are hints, not enforcement. Enforce permissions with scopes, server-side auth, approval policy, and audit logs.

Template

---
title: "Runbook: {service} - {incident}"
severity: "sev2"
services: ["{service}"]
owner: "{team}"
last_tested: "YYYY-MM-DD"
allowed_modes: ["diagnose", "recommend"]
---

## Trigger
- Alert:
- Condition:

## Diagnosis
| Step | Signal | Command/tool | Healthy threshold |
|---|---|---|---|

## Actions
### Action 1
- Condition:
- Mode:
- Command:
- Verify:
- Rollback/recovery:
- Approval required:

## Escalation
| Condition | Owner | Channel | SLA |
|---|---|---|---|

Checklist

Item	Check
trigger is numeric and observable	[ ]
diagnosis is read-only	[ ]
actions have modes	[ ]
destructive actions require approval	[ ]
each action has verification and recovery	[ ]
escalation and SLA are explicit	[ ]

Key takeaways

Runbooks must be structured enough for an agent to diagnose and recommend safely while keeping risky actions behind approval boundaries.
Three modes separate agent authority: Diagnose (read-only) and Recommend are default-on, while Act runs only when approved.
Every action needs a numeric, observable condition, an explicit mode, a command, a verification check, and a rollback/recovery path.
Map MCP Tools to modes and approval: read-only tools auto-run, while destructive tools like rollback_release or rotate_secret require human or change approval.
MCP annotations are hints, not enforcement; enforce permissions with scopes, server-side auth, approval policy, and audit logs.

Three Modes

Mode	Agent authority	Example	Default
Diagnose	read-only	inspect logs, metrics, status	yes
Recommend	propose action	suggest rollback or scale-up	yes
Act	execute limited action	purge cache, toggle safe flag	only when approved

Runbook Flow

Action Format

### Action: restart API worker

- **Condition**: p95 latency > 3s for 5 minutes and error rate < 1%
- **Mode**: Act allowed
- **Command**: `kubectl rollout restart deployment/api-worker`
- **Verify**: p95 latency < 1s within 5 minutes
- **Rollback/recovery**: `kubectl rollout undo deployment/api-worker`
- **Escalate**: call L1 if not recovered within 10 minutes

MCP Tool Boundary

Tool	Mode	Annotation hint	Approval
`get_metrics`	Diagnose	`readOnlyHint: true`	auto
`list_deployments`	Diagnose	`readOnlyHint: true`	auto
`purge_cache`	Act	`idempotentHint: true`	policy-dependent
`rollback_release`	Act	`destructiveHint: true`	human approval
`rotate_secret`	Act	`destructiveHint: true`	change approval

MCP annotations are hints, not enforcement. Enforce permissions with scopes, server-side auth, approval policy, and audit logs.

Template

---
title: "Runbook: {service} - {incident}"
severity: "sev2"
services: ["{service}"]
owner: "{team}"
last_tested: "YYYY-MM-DD"
allowed_modes: ["diagnose", "recommend"]
---

## Trigger
- Alert:
- Condition:

## Diagnosis
| Step | Signal | Command/tool | Healthy threshold |
|---|---|---|---|

## Actions
### Action 1
- Condition:
- Mode:
- Command:
- Verify:
- Rollback/recovery:
- Approval required:

## Escalation
| Condition | Owner | Channel | SLA |
|---|---|---|---|

Checklist

Item	Check
trigger is numeric and observable	[ ]
diagnosis is read-only	[ ]
actions have modes	[ ]
destructive actions require approval	[ ]
each action has verification and recovery	[ ]
escalation and SLA are explicit	[ ]

Three Modes

Runbook Flow

Action Format

MCP Tool Boundary

Template

Checklist

On This Page

Runbooks and Operations

Three Modes

Runbook Flow

Action Format

MCP Tool Boundary

Template

Checklist

On This Page