Ch4. Online Guardrails
Design real-time policy enforcement, blocking, fallback, and human approval loops
Key takeaways
- Treat online guardrails as control systems that limit business risk, layered as input, intent, tool, and output guards plus escalation.
- Design quality, cost, and safety fallbacks, and never fail open: degrade to restricted mode and require idempotency keys for high-risk tools.
- Use resumable human approval: pause the run, persist state and arguments, then resume the same run state after approval to keep one audit trail.
- Pick guard implementations by latency budget, from regex (
<1ms) to LLM-based intent guards (200-500ms), and optimize with parallel execution, caching, early exit, and streaming filters. - Extend guardrails to the MCP and Skill supply chain using OWASP MCP Top 10 and Agentic Skills Top 10 controls like allowlists, token audience validation, and code signing.
Online guardrails are not a supplement to model quality. They are control systems that limit business risk.
Guardrail Layers
| Layer | Blocks | Action |
|---|---|---|
| Input guard | Banned terms or attack patterns | Reject or ask for a new input |
| Intent guard | Disallowed intent | Restricted response |
| Tool guard | High-risk action | Approval wait |
| Output guard | PII or secret exposure | Automatic masking |
Fallback Design
- If a high-capability model fails, switch to a conservative prompt and a stable model.
- If quality regression is detected, roll back to the previous version.
- In over-budget windows, compress prompts and route to lighter models.
- Put cost ceilings on high-cost tenants and degrade them to restricted mode when needed.
- If policy judgment is uncertain, block automatic execution and move to human approval.
- Require idempotency keys for high-risk tools.
Gate Failure Handling
- Do not fail open. Policy failure must not permit unlimited execution.
- Degrade to restricted mode when guardrails fail.
- Require idempotency keys for high-risk actions.
Human Review and Resumable Approval
Even when the model decides that a high-risk tool call is needed, do not execute it immediately. Pause the run, store the reason and arguments, then resume the same run state after approval or rejection.
| Stage | Evidence to Store |
|---|---|
| Approval request | tool name, arguments, risk score, requester, trace_id |
| Pending review | serialized state, approval_id, SLA, reviewer group |
| Approval/rejection | reviewer, decision, edited arguments, reason |
| Resumption | resumed trace_id, final tool result, downstream action |
If review may take more than a few minutes, persist state. Do not restart from a new user turn. That preserves the same audit trail and idempotency key.
Approval Queue SLA Example
approval_queue:
refund_over_limit:
reviewer_group: finance-ops
sla_minutes: 15
auto_expire_minutes: 60
default_on_expiry: reject
evidence:
- trace_id
- approval_id
- tool_arguments
- risk_score
code_execution:
reviewer_group: platform-security
sla_minutes: 5
auto_expire_minutes: 20
default_on_expiry: rejectImplementation Complexity
Options by Layer
| Guardrail | Implementation | Complexity | Latency Impact |
|---|---|---|---|
| Input Guard | Regex-based | Low | < 1ms |
| Input Guard | ML classifier | Medium | 10-30ms |
| Intent Guard | LLM-based | High | 200-500ms |
| Tool Guard | Static rules | Low | < 1ms |
| Tool Guard | Dynamic risk score | Medium | 5-10ms |
| Output Guard | PII regex | Low | < 5ms |
| Output Guard | NER model | Medium | 20-50ms |
Latency Optimization
- Parallel execution: run independent guards concurrently.
- Caching: reuse repeated pattern results.
- Early exit: skip downstream guards when an input guard blocks the request.
- Streaming filter: filter responses while generation is in progress.
Escalation Patterns
- Synchronous approval: higher security, higher latency; use for financial transactions.
- Asynchronous approval: lower latency, medium security; use for bulk processing.
- Conditional auto-approval: minimum latency; use when risk scoring is reliable.
MCP and Skill Supply-Chain Guardrails
In AgentOps, the attack surface often expands through runtime extensions, not just the model. Use OWASP MCP Top 10 and Agentic Skills Top 10 controls as defaults.
| Risk | Default Control |
|---|---|
| Shadow MCP server | Manage server allowlist, owner, purpose, and scope in a registry |
| Tool poisoning | Do not trust tool output; validate provenance and content type |
| Token mismanagement | Validate audience, prohibit token passthrough, use short-lived tokens |
| Skill compromise | Require verified publisher, code signing, version pinning, permission manifest |
| Unexpected code execution | Use containers/sandboxes, limit filesystem/network egress, store execution logs |
2026 Guardrail Tooling
| Tool | Version/Status | Notes |
|---|---|---|
| NeMo Guardrails | v0.20.0 | NVIDIA, Colang modeling, parallel rails, native OpenTelemetry |
| Guardrails AI | v0.9.1 | Open-source Python, Guardrails Hub validators |
| Lakera Guard to Check Point | Acquisition complete | Acquired by Check Point in 2025.09, integrated into the Infinity platform, sub-50ms latency |
| OpenAI Agents SDK | Current | input/output/tool guardrails, human review, resumable state |
| Anthropic/Claude guardrails | Current | jailbreak, prompt leak, character consistency, streaming refusal guidance |
2026 Shift
NeMo Guardrails moved toward OpenTelemetry integration, so LLM calls, rail execution, and token usage can be unified in standard observability pipelines. Lakera Guard evolved into enterprise AI security after the Check Point acquisition.
Baseline and Sources
| Item | Baseline Date | Recheck By | Primary Source |
|---|---|---|---|
| OpenAI guardrails/human review | 2026-05-17 | 2026-06-16 | https://developers.openai.com/api/docs/guides/agents/guardrails-approvals |
| OWASP MCP Top 10 | 2026-05-17 | 2026-06-16 | https://owasp.org/www-project-mcp-top-10/ |
| OWASP Agentic Skills Top 10 | 2026-05-17 | 2026-06-16 | https://owasp.org/www-project-agentic-skills-top-10/ |
| Claude guardrails/refusal handling | 2026-05-17 | 2026-06-16 | https://docs.claude.com/en/docs/test-and-evaluate/strengthen-guardrails/handle-streaming-refusals |