Cost and Latency | LLMOps AgentOps

Manage model spend, response time, caching, routing, fallback, and quality checks together so AI services stay reliable within unit-cost targets.

Key takeaways

Manage model spend, response time, caching, routing, fallback, and quality checks together so AI services stay reliable within unit-cost targets.
Use this chapter as a first-pass operating checklist before changing systems, data, permissions, or customer-facing workflows.
Validate platform-specific details against current official docs or internal policy before rollout.

For AI services, cost and latency are two sides of the same operating problem.
Increasing model capability can raise cost; reducing cost can destabilize quality.

Budget Model

\text{Monthly AI Cost} = \text{Requests} \times \text{Unit Cost}

\text{Latency Budget} = T_{retrieve} + T_{infer} + T_{tool} + T_{post}

Optimization Priority

Priority	Lever	Expected Effect
1	Improve cache hit rate	Improve cost and latency together
2	Optimize prompt length	Reduce token cost
3	Model routing	Match cost to task complexity
4	Async tool calls	Improve p95 latency

Pareto Operating View

Optimize for the balance of cost and latency, not a single metric.

\text{Utility} = \alpha \cdot \text{Quality} - \beta \cdot \text{Cost} - \gamma \cdot \text{Latency}

Growth stage: raise α and prioritize quality.
Profitability stage: raise β and tighten cost controls.
Strict SLA stage: raise γ and prioritize latency.

Policy Example

routing_policy:
  - if: complexity <= 2
    model: 'cost_optimized'
  - if: complexity >= 4
    model: 'quality_optimized'

timeout_policy:
  tool_timeout_ms: 2500
  global_timeout_ms: 7000

The more important shift is not just lower prices. Cached input, batch processing, data residency, priority/flex handling, and tool runtime costs are increasingly billed separately. Model routing alone is not enough; manage cache hit rate and tool-call volume together.

2026 Cost Optimization Levers

Prompt Caching

Provider	Method	Cached Read Cost	Effect
OpenAI	Automatic 1,024+ token prefix caching, `prompt_cache_key`, 24h retention on some models	Up to 90% lower than standard input	Can improve latency by up to 80%
Anthropic	Automatic/explicit cache breakpoints, 5-minute/1-hour write, cache hit/refresh	10% of base input	Tool/system/message hierarchy changes can invalidate cache
DeepSeek	Context caching	Model-specific cache-hit price	V4 Flash cache hit is $0.0028/M

Batch API

Anthropic and OpenAI both provide 50% batch discounts. Use batch for evaluation, classification, embeddings, and large replays that do not need immediate responses.

Tool/Runtime Costs

Cost Item	Management Standard
Web search	Track call count and result tokens separately
Code/container execution	Track container session time, file preload, stdout/stderr size
MCP/tool schema	Track tool definition tokens and schema cache hit rate
Voice/realtime	Separate audio tokens, first-audio latency, and interruption retry cost

Cost Ledger Example

cost_ledger:
  run_id: run_20260517_001
  model:
    input_tokens: 3840
    cached_input_tokens: 2560
    output_tokens: 620
    unit_cost_usd: 0.00418
  tools:
    web_search_calls: 1
    container_minutes: 3
    mcp_schema_tokens: 1800
  business:
    tenant_id: acme-enterprise
    task_success: true
    cost_per_successful_task_usd: 0.014

Model Routing Services

Service	Method
Martian	Real-time model routing per prompt
Not Diamond	Prompt transformation plus model selection
Unify AI	Quality/cost/speed optimized routing
OpenRouter	Multi-provider marketplace with caching support

Executive KPIs

Gross Margin with AI Cost
p95 Latency by Top Revenue Flows
Cost per Successful Task

Practice Tip

Cost reduction often comes first from reducing unnecessary output tokens, repeated tool calls, and cache misses, not from switching models. Recheck pricing before release and make operating decisions using cost per successful task.

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OpenAI model/tool pricing	2026-05-17	2026-06-16	https://openai.com/api/pricing/
Claude model/tool pricing	2026-05-17	2026-06-16	https://platform.claude.com/docs/en/about-claude/pricing
DeepSeek V4 pricing	2026-05-17	2026-05-31	https://api-docs.deepseek.com/quick_start/pricing/

Key takeaways

Manage model spend, response time, caching, routing, fallback, and quality checks together so AI services stay reliable within unit-cost targets.
Use this chapter as a first-pass operating checklist before changing systems, data, permissions, or customer-facing workflows.
Validate platform-specific details against current official docs or internal policy before rollout.

For AI services, cost and latency are two sides of the same operating problem.
Increasing model capability can raise cost; reducing cost can destabilize quality.

Budget Model

\text{Monthly AI Cost} = \text{Requests} \times \text{Unit Cost}

\text{Latency Budget} = T_{retrieve} + T_{infer} + T_{tool} + T_{post}

Optimization Priority

Priority	Lever	Expected Effect
1	Improve cache hit rate	Improve cost and latency together
2	Optimize prompt length	Reduce token cost
3	Model routing	Match cost to task complexity
4	Async tool calls	Improve p95 latency

Pareto Operating View

Optimize for the balance of cost and latency, not a single metric.

\text{Utility} = \alpha \cdot \text{Quality} - \beta \cdot \text{Cost} - \gamma \cdot \text{Latency}

Growth stage: raise α and prioritize quality.
Profitability stage: raise β and tighten cost controls.
Strict SLA stage: raise γ and prioritize latency.

Policy Example

routing_policy:
  - if: complexity <= 2
    model: 'cost_optimized'
  - if: complexity >= 4
    model: 'quality_optimized'

timeout_policy:
  tool_timeout_ms: 2500
  global_timeout_ms: 7000

2026 Model Pricing Baseline

The table below uses official pricing pages checked on 2026-05-17. Model prices change often, so store the baseline date with every release gate and budget calculation.

Model	Input (/1M)	Cached Input (/1M)	Output (/1M)	Notes
GPT-5.5	$5.00	$0.50	$30.00	OpenAI flagship
GPT-5.4	$2.50	$0.25	$15.00	Coding and professional work
GPT-5.4 mini	$0.75	$0.075	$4.50	Lightweight coding/computer-use/subagent
Claude Opus 4.7/4.6/4.5	$5.00	$0.50	$25.00	Consider tokenizer effects for Opus 4.7
Claude Sonnet 4.6/4.5	$3.00	$0.30	$15.00	Sonnet 4 is deprecated
Claude Haiku 4.5	$1.00	$0.10	$5.00	Fast lightweight model
DeepSeek V4 Flash	$0.14 cache miss	$0.0028	$0.28	1M context, thinking/non-thinking
DeepSeek V4 Pro	$0.435 cache miss	$0.003625	$0.87	75% discount requires recheck by 2026-05-31 15:59 UTC

Cost Strategy

2026 Cost Optimization Levers

Prompt Caching

Provider	Method	Cached Read Cost	Effect
OpenAI	Automatic 1,024+ token prefix caching, `prompt_cache_key`, 24h retention on some models	Up to 90% lower than standard input	Can improve latency by up to 80%
Anthropic	Automatic/explicit cache breakpoints, 5-minute/1-hour write, cache hit/refresh	10% of base input	Tool/system/message hierarchy changes can invalidate cache
DeepSeek	Context caching	Model-specific cache-hit price	V4 Flash cache hit is $0.0028/M

Batch API

Anthropic and OpenAI both provide 50% batch discounts. Use batch for evaluation, classification, embeddings, and large replays that do not need immediate responses.

Tool/Runtime Costs

Cost Item	Management Standard
Web search	Track call count and result tokens separately
Code/container execution	Track container session time, file preload, stdout/stderr size
MCP/tool schema	Track tool definition tokens and schema cache hit rate
Voice/realtime	Separate audio tokens, first-audio latency, and interruption retry cost

Cost Ledger Example

cost_ledger:
  run_id: run_20260517_001
  model:
    input_tokens: 3840
    cached_input_tokens: 2560
    output_tokens: 620
    unit_cost_usd: 0.00418
  tools:
    web_search_calls: 1
    container_minutes: 3
    mcp_schema_tokens: 1800
  business:
    tenant_id: acme-enterprise
    task_success: true
    cost_per_successful_task_usd: 0.014

Model Routing Services

Service	Method
Martian	Real-time model routing per prompt
Not Diamond	Prompt transformation plus model selection
Unify AI	Quality/cost/speed optimized routing
OpenRouter	Multi-provider marketplace with caching support

Executive KPIs

Gross Margin with AI Cost
p95 Latency by Top Revenue Flows
Cost per Successful Task

Practice Tip

Baseline and Sources

Item	Baseline Date	Recheck By	Primary Source
OpenAI model/tool pricing	2026-05-17	2026-06-16	https://openai.com/api/pricing/
Claude model/tool pricing	2026-05-17	2026-06-16	https://platform.claude.com/docs/en/about-claude/pricing
DeepSeek V4 pricing	2026-05-17	2026-05-31	https://api-docs.deepseek.com/quick_start/pricing/

Ch6. Cost and Latency Optimization

Budget Model

Optimization Priority

Pareto Operating View

Policy Example

2026 Model Pricing Baseline

2026 Cost Optimization Levers

Prompt Caching

Batch API

Tool/Runtime Costs

Cost Ledger Example

Model Routing Services

Executive KPIs

Baseline and Sources

On This Page

Ch6. Cost and Latency Optimization

Budget Model

Optimization Priority

Pareto Operating View

Policy Example

2026 Model Pricing Baseline

2026 Cost Optimization Levers

Prompt Caching

Batch API

Tool/Runtime Costs

Cost Ledger Example

Model Routing Services

Executive KPIs

Baseline and Sources

On This Page