Ch6. Cost and Latency Optimization
Manage model spend, response time, caching, routing, fallback, and quality checks together so AI services stay reliable within unit-cost targets.
Key takeaways
- Manage model spend, response time, caching, routing, fallback, and quality checks together so AI services stay reliable within unit-cost targets.
- Use this chapter as a first-pass operating checklist before changing systems, data, permissions, or customer-facing workflows.
- Validate platform-specific details against current official docs or internal policy before rollout.
For AI services, cost and latency are two sides of the same operating problem.
Increasing model capability can raise cost; reducing cost can destabilize quality.
Budget Model
Optimization Priority
| Priority | Lever | Expected Effect |
|---|---|---|
| 1 | Improve cache hit rate | Improve cost and latency together |
| 2 | Optimize prompt length | Reduce token cost |
| 3 | Model routing | Match cost to task complexity |
| 4 | Async tool calls | Improve p95 latency |
Pareto Operating View
Optimize for the balance of cost and latency, not a single metric.
- Growth stage: raise
αand prioritize quality. - Profitability stage: raise
βand tighten cost controls. - Strict SLA stage: raise
γand prioritize latency.
Policy Example
routing_policy:
- if: complexity <= 2
model: 'cost_optimized'
- if: complexity >= 4
model: 'quality_optimized'
timeout_policy:
tool_timeout_ms: 2500
global_timeout_ms: 70002026 Model Pricing Baseline
The table below uses official pricing pages checked on 2026-05-17. Model prices change often, so store the baseline date with every release gate and budget calculation.
| Model | Input (/1M) | Cached Input (/1M) | Output (/1M) | Notes |
|---|---|---|---|---|
| GPT-5.5 | $5.00 | $0.50 | $30.00 | OpenAI flagship |
| GPT-5.4 | $2.50 | $0.25 | $15.00 | Coding and professional work |
| GPT-5.4 mini | $0.75 | $0.075 | $4.50 | Lightweight coding/computer-use/subagent |
| Claude Opus 4.7/4.6/4.5 | $5.00 | $0.50 | $25.00 | Consider tokenizer effects for Opus 4.7 |
| Claude Sonnet 4.6/4.5 | $3.00 | $0.30 | $15.00 | Sonnet 4 is deprecated |
| Claude Haiku 4.5 | $1.00 | $0.10 | $5.00 | Fast lightweight model |
| DeepSeek V4 Flash | $0.14 cache miss | $0.0028 | $0.28 | 1M context, thinking/non-thinking |
| DeepSeek V4 Pro | $0.435 cache miss | $0.003625 | $0.87 | 75% discount requires recheck by 2026-05-31 15:59 UTC |
Cost Strategy
The more important shift is not just lower prices. Cached input, batch processing, data residency, priority/flex handling, and tool runtime costs are increasingly billed separately. Model routing alone is not enough; manage cache hit rate and tool-call volume together.
2026 Cost Optimization Levers
Prompt Caching
| Provider | Method | Cached Read Cost | Effect |
|---|---|---|---|
| OpenAI | Automatic 1,024+ token prefix caching, prompt_cache_key, 24h retention on some models | Up to 90% lower than standard input | Can improve latency by up to 80% |
| Anthropic | Automatic/explicit cache breakpoints, 5-minute/1-hour write, cache hit/refresh | 10% of base input | Tool/system/message hierarchy changes can invalidate cache |
| DeepSeek | Context caching | Model-specific cache-hit price | V4 Flash cache hit is $0.0028/M |
Batch API
Anthropic and OpenAI both provide 50% batch discounts. Use batch for evaluation, classification, embeddings, and large replays that do not need immediate responses.
Tool/Runtime Costs
| Cost Item | Management Standard |
|---|---|
| Web search | Track call count and result tokens separately |
| Code/container execution | Track container session time, file preload, stdout/stderr size |
| MCP/tool schema | Track tool definition tokens and schema cache hit rate |
| Voice/realtime | Separate audio tokens, first-audio latency, and interruption retry cost |
Cost Ledger Example
cost_ledger:
run_id: run_20260517_001
model:
input_tokens: 3840
cached_input_tokens: 2560
output_tokens: 620
unit_cost_usd: 0.00418
tools:
web_search_calls: 1
container_minutes: 3
mcp_schema_tokens: 1800
business:
tenant_id: acme-enterprise
task_success: true
cost_per_successful_task_usd: 0.014Model Routing Services
| Service | Method |
|---|---|
| Martian | Real-time model routing per prompt |
| Not Diamond | Prompt transformation plus model selection |
| Unify AI | Quality/cost/speed optimized routing |
| OpenRouter | Multi-provider marketplace with caching support |
Executive KPIs
- Gross Margin with AI Cost
- p95 Latency by Top Revenue Flows
- Cost per Successful Task
Practice Tip
Cost reduction often comes first from reducing unnecessary output tokens, repeated tool calls, and cache misses,
not from switching models. Recheck pricing before release and make operating decisions using cost per successful task.
Baseline and Sources
| Item | Baseline Date | Recheck By | Primary Source |
|---|---|---|---|
| OpenAI model/tool pricing | 2026-05-17 | 2026-06-16 | https://openai.com/api/pricing/ |
| Claude model/tool pricing | 2026-05-17 | 2026-06-16 | https://platform.claude.com/docs/en/about-claude/pricing |
| DeepSeek V4 pricing | 2026-05-17 | 2026-05-31 | https://api-docs.deepseek.com/quick_start/pricing/ |