Ch4. Online Guardrails

Design real-time policy enforcement, blocking, fallback, and human approval loops

Key takeaways

Treat online guardrails as control systems that limit business risk, layered as input, intent, tool, and output guards plus escalation.
Design quality, cost, and safety fallbacks, and never fail open: degrade to restricted mode and require idempotency keys for high-risk tools.
Use resumable human approval: pause the run, persist state and arguments, then resume the same run state after approval to keep one audit trail.
Pick guard implementations by latency budget, from regex (<1ms) to LLM-based intent guards (200-500ms), and optimize with parallel execution, caching, early exit, and streaming filters.
Extend guardrails to the MCP and Skill supply chain using OWASP MCP Top 10 and Agentic Skills Top 10 controls like allowlists, token audience validation, and code signing.

Online guardrails are not a supplement to model quality. They are control systems that limit business risk.

Guardrail Layers

Layer	Blocks	Action
Input guard	Banned terms or attack patterns	Reject or ask for a new input
Intent guard	Disallowed intent	Restricted response
Tool guard	High-risk action	Approval wait
Output guard	PII or secret exposure	Automatic masking

Fallback Design

If a high-capability model fails, switch to a conservative prompt and a stable model.
If quality regression is detected, roll back to the previous version.

In over-budget windows, compress prompts and route to lighter models.
Put cost ceilings on high-cost tenants and degrade them to restricted mode when needed.

If policy judgment is uncertain, block automatic execution and move to human approval.
Require idempotency keys for high-risk tools.

Gate Failure Handling

Do not fail open. Policy failure must not permit unlimited execution.
Degrade to restricted mode when guardrails fail.
Require idempotency keys for high-risk actions.

Human Review and Resumable Approval

Even when the model decides that a high-risk tool call is needed, do not execute it immediately. Pause the run, store the reason and arguments, then resume the same run state after approval or rejection.

Stage	Evidence to Store
Approval request	tool name, arguments, risk score, requester, trace_id
Pending review	serialized state, approval_id, SLA, reviewer group
Approval/rejection	reviewer, decision, edited arguments, reason
Resumption	resumed trace_id, final tool result, downstream action

If review may take more than a few minutes, persist state. Do not restart from a new user turn. That preserves the same audit trail and idempotency key.

Approval Queue SLA Example

approval_queue:
  refund_over_limit:
    reviewer_group: finance-ops
    sla_minutes: 15
    auto_expire_minutes: 60
    default_on_expiry: reject
    evidence:
      - trace_id
      - approval_id
      - tool_arguments
      - risk_score
  code_execution:
    reviewer_group: platform-security
    sla_minutes: 5
    auto_expire_minutes: 20
    default_on_expiry: reject

Implementation Complexity

Options by Layer

Guardrail	Implementation	Complexity	Latency Impact
Input Guard	Regex-based	Low	< 1ms
Input Guard	ML classifier	Medium	10-30ms
Intent Guard	LLM-based	High	200-500ms
Tool Guard	Static rules	Low	< 1ms
Tool Guard	Dynamic risk score	Medium	5-10ms
Output Guard	PII regex	Low	< 5ms
Output Guard	NER model	Medium	20-50ms

Latency Optimization

Parallel execution: run independent guards concurrently.
Caching: reuse repeated pattern results.
Early exit: skip downstream guards when an input guard blocks the request.
Streaming filter: filter responses while generation is in progress.

Escalation Patterns

Synchronous approval: higher security, higher latency; use for financial transactions.
Asynchronous approval: lower latency, medium security; use for bulk processing.
Conditional auto-approval: minimum latency; use when risk scoring is reliable.

MCP and Skill Supply-Chain Guardrails

In AgentOps, the attack surface often expands through runtime extensions, not just the model. Use OWASP MCP Top 10 and Agentic Skills Top 10 controls as defaults.

Risk	Default Control
Shadow MCP server	Manage server allowlist, owner, purpose, and scope in a registry
Tool poisoning	Do not trust tool output; validate provenance and content type
Token mismanagement	Validate audience, prohibit token passthrough, use short-lived tokens
Skill compromise	Require verified publisher, code signing, version pinning, permission manifest
Unexpected code execution	Use containers/sandboxes, limit filesystem/network egress, store execution logs

2026 Guardrail Tooling

Tool	Version/Status	Notes
NeMo Guardrails	v0.20.0	NVIDIA, Colang modeling, parallel rails, native OpenTelemetry
Guardrails AI	v0.9.1	Open-source Python, Guardrails Hub validators
Lakera Guard to Check Point	Acquisition complete	Acquired by Check Point in 2025.09, integrated into the Infinity platform, sub-50ms latency
OpenAI Agents SDK	Current	input/output/tool guardrails, human review, resumable state
Anthropic/Claude guardrails	Current	jailbreak, prompt leak, character consistency, streaming refusal guidance

2026 Shift

NeMo Guardrails moved toward OpenTelemetry integration, so LLM calls, rail execution, and token usage can be unified in standard observability pipelines. Lakera Guard evolved into enterprise AI security after the Check Point acquisition.

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OpenAI guardrails/human review	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agents/guardrails-approvals
OWASP MCP Top 10	2026-05-17	2026-06-16	https://owasp.org/www-project-mcp-top-10/
OWASP Agentic Skills Top 10	2026-05-17	2026-06-16	https://owasp.org/www-project-agentic-skills-top-10/
Claude guardrails/refusal handling	2026-05-17	2026-06-16	https://docs.claude.com/en/docs/test-and-evaluate/strengthen-guardrails/handle-streaming-refusals

Key takeaways

Treat online guardrails as control systems that limit business risk, layered as input, intent, tool, and output guards plus escalation.
Design quality, cost, and safety fallbacks, and never fail open: degrade to restricted mode and require idempotency keys for high-risk tools.
Use resumable human approval: pause the run, persist state and arguments, then resume the same run state after approval to keep one audit trail.
Pick guard implementations by latency budget, from regex (<1ms) to LLM-based intent guards (200-500ms), and optimize with parallel execution, caching, early exit, and streaming filters.
Extend guardrails to the MCP and Skill supply chain using OWASP MCP Top 10 and Agentic Skills Top 10 controls like allowlists, token audience validation, and code signing.

Online guardrails are not a supplement to model quality. They are control systems that limit business risk.

Guardrail Layers

Layer	Blocks	Action
Input guard	Banned terms or attack patterns	Reject or ask for a new input
Intent guard	Disallowed intent	Restricted response
Tool guard	High-risk action	Approval wait
Output guard	PII or secret exposure	Automatic masking

Fallback Design

If a high-capability model fails, switch to a conservative prompt and a stable model.
If quality regression is detected, roll back to the previous version.

In over-budget windows, compress prompts and route to lighter models.
Put cost ceilings on high-cost tenants and degrade them to restricted mode when needed.

If policy judgment is uncertain, block automatic execution and move to human approval.
Require idempotency keys for high-risk tools.

Gate Failure Handling

Do not fail open. Policy failure must not permit unlimited execution.
Degrade to restricted mode when guardrails fail.
Require idempotency keys for high-risk actions.

Human Review and Resumable Approval

Stage	Evidence to Store
Approval request	tool name, arguments, risk score, requester, trace_id
Pending review	serialized state, approval_id, SLA, reviewer group
Approval/rejection	reviewer, decision, edited arguments, reason
Resumption	resumed trace_id, final tool result, downstream action

If review may take more than a few minutes, persist state. Do not restart from a new user turn. That preserves the same audit trail and idempotency key.

Approval Queue SLA Example

approval_queue:
  refund_over_limit:
    reviewer_group: finance-ops
    sla_minutes: 15
    auto_expire_minutes: 60
    default_on_expiry: reject
    evidence:
      - trace_id
      - approval_id
      - tool_arguments
      - risk_score
  code_execution:
    reviewer_group: platform-security
    sla_minutes: 5
    auto_expire_minutes: 20
    default_on_expiry: reject

Implementation Complexity

Options by Layer

Guardrail	Implementation	Complexity	Latency Impact
Input Guard	Regex-based	Low	< 1ms
Input Guard	ML classifier	Medium	10-30ms
Intent Guard	LLM-based	High	200-500ms
Tool Guard	Static rules	Low	< 1ms
Tool Guard	Dynamic risk score	Medium	5-10ms
Output Guard	PII regex	Low	< 5ms
Output Guard	NER model	Medium	20-50ms

Latency Optimization

Parallel execution: run independent guards concurrently.
Caching: reuse repeated pattern results.
Early exit: skip downstream guards when an input guard blocks the request.
Streaming filter: filter responses while generation is in progress.

Escalation Patterns

Synchronous approval: higher security, higher latency; use for financial transactions.
Asynchronous approval: lower latency, medium security; use for bulk processing.
Conditional auto-approval: minimum latency; use when risk scoring is reliable.

MCP and Skill Supply-Chain Guardrails

In AgentOps, the attack surface often expands through runtime extensions, not just the model. Use OWASP MCP Top 10 and Agentic Skills Top 10 controls as defaults.

Risk	Default Control
Shadow MCP server	Manage server allowlist, owner, purpose, and scope in a registry
Tool poisoning	Do not trust tool output; validate provenance and content type
Token mismanagement	Validate audience, prohibit token passthrough, use short-lived tokens
Skill compromise	Require verified publisher, code signing, version pinning, permission manifest
Unexpected code execution	Use containers/sandboxes, limit filesystem/network egress, store execution logs

2026 Guardrail Tooling

Tool	Version/Status	Notes
NeMo Guardrails	v0.20.0	NVIDIA, Colang modeling, parallel rails, native OpenTelemetry
Guardrails AI	v0.9.1	Open-source Python, Guardrails Hub validators
Lakera Guard to Check Point	Acquisition complete	Acquired by Check Point in 2025.09, integrated into the Infinity platform, sub-50ms latency
OpenAI Agents SDK	Current	input/output/tool guardrails, human review, resumable state
Anthropic/Claude guardrails	Current	jailbreak, prompt leak, character consistency, streaming refusal guidance

2026 Shift

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OpenAI guardrails/human review	2026-05-17	2026-06-16	https://developers.openai.com/api/docs/guides/agents/guardrails-approvals
OWASP MCP Top 10	2026-05-17	2026-06-16	https://owasp.org/www-project-mcp-top-10/
OWASP Agentic Skills Top 10	2026-05-17	2026-06-16	https://owasp.org/www-project-agentic-skills-top-10/
Claude guardrails/refusal handling	2026-05-17	2026-06-16	https://docs.claude.com/en/docs/test-and-evaluate/strengthen-guardrails/handle-streaming-refusals

Guardrail Layers

Fallback Design

Gate Failure Handling

Human Review and Resumable Approval

Approval Queue SLA Example

Implementation Complexity

Options by Layer

Latency Optimization

Escalation Patterns

MCP and Skill Supply-Chain Guardrails

2026 Guardrail Tooling

Baseline and Sources

On This Page

Ch4. Online Guardrails

Guardrail Layers

Fallback Design

Gate Failure Handling

Human Review and Resumable Approval

Approval Queue SLA Example

Implementation Complexity

Options by Layer

Latency Optimization

Escalation Patterns

MCP and Skill Supply-Chain Guardrails

2026 Guardrail Tooling

Baseline and Sources

On This Page