Monitoring and Incident
Operate logs, metrics, traces, alerts, runbooks, and post-incident learning.
Key takeaways
- Observability stacks six layers: logs (what happened), metrics (how much/fast), traces (where), alerts (what needs attention now), dashboards (current health), and runbooks (what to do next).
- Incident flow runs detect, triage, contain, resolve, communicate, postmortem, and prevent.
- A runbook captures symptom, scope, checks, mitigation (rollback, flag, rate limit, workaround), escalation, and follow-up.
- Postmortems focus on system improvement over individual blame, recording timeline, impact, root causes, and concrete owned follow-up tasks.
- Both monitoring and incident response need ownership and rehearsal before production pressure arrives.
Monitoring tells the team whether the system is healthy. Incident practice tells the team what to do when it is not. Both need ownership and rehearsal before production pressure arrives.
Observability Layers
| Layer | Question answered |
|---|---|
| Logs | What happened? |
| Metrics | How much, how often, and how fast? |
| Traces | Where did time or failure occur? |
| Alerts | What needs attention now? |
| Dashboards | What is the current health picture? |
| Runbooks | What should the responder do next? |
Incident Flow
Runbook Template
| Section | Contents |
|---|---|
| Symptom | What alert or customer issue appears |
| Scope | Affected app, route, service, or customer segment |
| Checks | Logs, dashboard, recent deployments, dependencies |
| Mitigation | Rollback, feature flag, rate limit, manual workaround |
| Escalation | Owner, platform contact, business stakeholder |
| Follow-up | Tests, alerts, documentation, architecture fix |
Postmortem Rules
- Focus on system improvement, not individual blame.
- Record timeline, impact, root causes, and contributing factors.
- Create concrete follow-up tasks with owners.
- Update runbooks and alerts when detection was weak.
- Review whether deployment or review gates should change.