LLMOps and AgentOps in Production
A production operating system for turning experimental AI features into reliable services
Recently Updated Chapters
Connect offline benchmarks with online operating signals
Repeat prompt, model, and workflow experiments quickly and safely
Operate a unified standard for quality regressions, cost spikes, and policy bypass incidents
Collect model, tool, and policy execution as traceable signals and operate them through SLOs
Design real-time policy enforcement, blocking, fallback, and human approval loops
Making an AI feature work is not the same as making it operable.
In production, model quality must be managed alongside release control, SLOs, cost stability, and incident response.
This handbook treats LLMOps and AgentOps as one operating system rather than separate disciplines.
Core Goal
Build an operating foundation where quality, cost, and security remain stable even as teams repeatedly change models, prompts, tools, and agent workflows.
English Edition
This English edition was selected because AgentOps, MCP/A2A, trace-first evaluation, and AI cost governance are high-interest topics for international platform, SRE, and AI infrastructure teams.
May 2026 Update
- A2A latest v1.0.0 and MCP 2025-11-25 security requirements: OAuth 2.1, audience binding, and token passthrough prohibition (Ch1)
- Trace-first evaluation, agent workflow trace grading, and production trace to dataset/eval loops (Ch3, Ch5)
- Human review, resumable approval state, hosted/private MCP trust boundaries, and Agentic Skills supply-chain security (Ch4)
- OpenTelemetry GenAI
Developmentstatus and OWASP AOS work-in-progress status clarified (Ch5) - GPT-5.5/GPT-5.4/GPT-5.4 mini, Claude 4.7/4.6/4.5, and DeepSeek V4 pricing baseline refreshed (Ch6)
- Incident handling expanded for MCP/skill compromise, A2A webhook abuse, and automated recovery approval boundaries (Ch8)
Core Operating Formulas
Operating Maturity Model
| Level | State | Characteristics | Promotion Criteria |
|---|---|---|---|
| L1 Prototype | Demo-driven | Manual prompts and ad hoc operations | Standardized logs |
| L2 Controlled | Basic operations | Versioning and release control introduced | Offline evaluation system |
| L3 Reliable | Reliable operations | SLOs, guardrails, and fallback automation | Joint cost/quality optimization |
| L4 Adaptive | Supervised adaptation | Drift detection, policy tuning, automated recovery | Change evidence and approval logs retained |
Go-Live Gates
| Gate | Example Pass Criteria |
|---|---|
| Quality gate | Core task success rate >= 95% |
| Safety gate | Policy violation rate <= 0.2% |
| Performance gate | p95 latency within budget |
| Cost gate | Unit cost within budget +5% |
Operating Loop
Contents
Ch1. System Architecture
Separate the control plane, data plane, and agent runtime boundaries.
Ch2. Versioning and Release
Ship prompt, model, tool, and policy changes with release discipline.
Ch3. Evaluation Framework
Connect offline evaluation, online signals, and trace-derived regression tests.
Ch4. Online Guardrails
Enforce policy, blocking, fallback, and human approval loops.
Ch5. Observability and SLOs
Observe traces, tokens, latency, quality, policies, and approvals together.
Ch6. Cost and Latency
Manage unit-cost budgets and p95 latency budgets at the same time.