Ch2. Versioning and Release
Release prompts, models, tools, and policies as traceable artifacts
Key takeaways
- The release unit for an AI service bundles prompts, models, tools, and policy packs into one traceable artifact, each with its own version rule.
- A release artifact must record not only what changed but which permissions, eval set, and observability schema validated it.
- Gate releases on both quality and security thresholds (quality delta ≥ -1%, cost ≤ +5%, PII exposure == 0, injection-test pass == 100%) before canary rollout at 5-10%.
- Run a security checklist covering policy tests, PII scans, injection tests, permission diffs, and approval-resumption tests.
- Roll back immediately on PII exposure, policy bypass over 0.5%, privilege escalation, injection success, or pre-auth A2A resource leakage, even if quality improved.
The release unit for an AI service is not just code.
Prompts, models, tools, and policy packs must be managed as one release artifact if you want to control regressions.
Version Components
| Component | Recommended Version Rule |
|---|---|
| Prompt | p-YYYYMMDD.N |
| Model | Provider model ID plus internal compatibility level |
| Tool Schema | SemVer (major.minor.patch) |
| Policy Pack | Hash version with approval history |
| MCP Server | Allowlist ID plus scope plus server version |
| Eval Set | Dataset hash plus grader version |
| Trace Schema | OTel/AOS field version |
The release artifact must show not only what changed, but also which permissions, evaluation set, and observability schema were used to validate it.
Release Pipeline
Pass offline evaluation criteria: quality, safety, and cost.
Pass security checks: policy tests, PII scan, and prompt-injection tests.
Start canary traffic at 5-10%.
Monitor burn rate for errors, latency, cost, and security violations.
Expand gradually if healthy: 25% to 50% to 100%.
Release Gate Example
release_gate:
# Quality and performance
quality_delta: '>= -1.0%'
cost_delta: '<= +5%'
p95_latency_delta: '<= +10%'
# Security
safety_violation_rate: '<= 0.2%'
pii_exposure_count: '== 0'
prompt_injection_test_pass_rate: '== 100%'
privilege_escalation_attempts: '== 0'
security_policy_test_coverage: '>= 95%'
# Agent/tool controls
unapproved_tool_scope_changes: '== 0'
mcp_server_allowlist_diff: 'reviewed'
trace_schema_compatible: true
approval_resume_test_passed: trueRelease Artifact Manifest Example
release_artifact:
app_version: ai-support-2026.05.17-1
prompt_version: p-20260517.2
model_policy: routing-20260517
tool_schema_version: tools-1.8.0
mcp_allowlist_hash: sha256:8d4f...
skill_manifest_hash: sha256:52ac...
eval_dataset_hash: sha256:93ab...
grader_version: judge-20260517.1
trace_schema: genai-otel-1.41.0+aos-adapter-0.3
approvals:
owner: platform-ai
security: approved
compliance: approvedSecurity Verification Checklist
- Policy tests: system-instruction bypass, tool access control, output filtering.
- PII scan: sensitive data in prompt templates, RAG documents, and system messages.
- Injection tests: direct prompt bypass, RAG poisoning, and tool-result manipulation.
- Permission diff review: added or expanded scopes for MCP servers, skills, and function tools.
- Approval resumption test: a run paused for human review resumes from the same state after approval or rejection.
- Trace compatibility: new trace fields remain compatible with dashboards, eval graders, and incident runbooks.
Security Rollback Conditions
- Roll back immediately when PII exposure is detected.
- Roll back when policy bypass exceeds 0.5%.
- Roll back when privilege escalation attempts are detected.
- Roll back when a prompt-injection success case is found.
- Roll back when a new MCP server or skill performs unauthorized network, file, or system access.
- Roll back when an A2A peer leaks internal resource existence before authentication.
Security First
Do not proceed with release if any security criterion fails, even when performance or quality improves. Security is a non-negotiable gate.
Practice Principle
Model upgrades often carry more regression risk than expected. Prefer parallel operation and staged rollout over an immediate full switch.
Baseline and Sources
| Item | Baseline Date | Recheck By | Primary Source |
|---|---|---|---|
| Human review/resumable state | 2026-05-17 | 2026-06-16 | https://developers.openai.com/api/docs/guides/agents/guardrails-approvals |
| OTel GenAI trace schema | 2026-05-17 | 2026-06-16 | https://opentelemetry.io/docs/specs/semconv/gen-ai/ |
| MCP/Skill scope control | 2026-05-17 | 2026-06-16 | https://owasp.org/www-project-mcp-top-10/ |