Verification Archive

Historical verification records and correction history for LLMOps and AgentOps in Production

Key takeaways

This page is a historical archive of past verification rounds; use the current Verification Report as the operating baseline.
The 2nd verification (2026-03-13) confirmed tool versions such as Langfuse v4.0.0, Arize Phoenix v13.0.3, DeepEval v3.8.9, and the Lakera-to-Check Point acquisition.
Several 3rd-verification (2026-03-26) items were later superseded in the 4th verification, including GPT-5.4 nano pricing, DeepSeek V3.2 pricing, and A2A versioning.
The generalized "LLM API price decline" claim was removed and replaced with provider-specific pricing, caching, and batch conditions.

Archive Notice

This page contains historical verification records. Use Verification Report as the current operating baseline.

2nd Verification (2026-03-13)

Tool and Framework Version Checks

Item	Verification	Result
Langfuse v4.0.0	2026-03-10 release, MIT open source verified	Pass
Arize Phoenix v13.0.3	2026-02-14 release, CLI v0.1.0+ verified	Pass
DeepEval v3.8.9	2026-03-05 release, 13K+ GitHub stars verified	Pass
RAGAS v0.4.3	2026-01-13 release, PyPI verified	Pass
Inspect AI v0.3.186	2026-03-03 release, UK AISI verified	Pass
NeMo Guardrails v0.20.0	OTel migration verified	Pass
MCP spec 2025-11-25	Reworked around authorization/security requirements during 2026-05-17 verification	Expanded
A2A v0.3.0	Corrected to latest v1.0.0 during 2026-05-17 verification	Superseded

Cost Optimization Data Checks

Item	Verification	Result
Anthropic prompt caching	Reworked around model-specific caching multipliers during 2026-05-17 verification	Expanded
OpenAI prompt caching	Reworked around model-specific cached input pricing during 2026-05-17 verification	Expanded
Lakera to Check Point acquisition	2025.09 acquisition complete, approximately $300M	Pass

2nd Verification External Sources

Source	Checked Area	Status
Langfuse Changelog	v4.0.0 release	200
Arize Phoenix GitHub Releases	v13.0.3 release	200
DeepEval GitHub	v3.8.9, evaluation metrics	200
OpenTelemetry GenAI Docs	Semantic Conventions experimental (2nd verification baseline)	200
Anthropic API Docs (Prompt Caching)	Caching pricing policy	200
Check Point acquisition release	Lakera Guard acquisition	200

Model pricing, DeepSeek model naming, A2A versioning, and some vendor links from the 3rd verification were replaced with current baselines during the 4th verification. This section remains as history only.

New Content Checks

Item	Verification	Result
OWASP AOS	Three axes verified: Instrumentable, Traceable, Inspectable	Pass
LangSmith Fleet rebrand	Agent Builder to LangSmith Fleet and four new capabilities verified	Pass
Braintrust Loop AI	Natural-language scorer generation, four SDK additions, OTel native support verified	Pass
GPT-5.4 nano pricing	Corrected to GPT-5.4 mini pricing table during 2026-05-17 verification	Superseded
DeepSeek V3.2 pricing	Replaced with DeepSeek V4 Flash/Pro pricing during 2026-05-17 verification	Superseded
Anthropic 1M surcharge removal	Reworked into official Claude 4.x model pricing during 2026-05-17 verification	Superseded
LLM API price decline claim	Removed generalized decline-rate language and replaced with provider-specific pricing, caching, and batch conditions	Corrected
LiveCodeBench/AIME 2026	Benchmark existence and usage verified	Pass
TAU-bench Retail/JBDistill	Agent and safety benchmarks verified	Pass
PagerDuty AI agentic operations	Agentic cloud operations model and automated recovery capability verified	Pass

3rd Verification External Sources

Source	Checked Area	Status
OWASP official project	Agent Observability Standard	200
LangChain blog	LangSmith Fleet rebrand announcement	200
Braintrust docs	Loop AI, new SDKs	200
Anthropic pricing page	Claude 4.x model pricing	Checked
OpenAI pricing page	GPT-5.4 mini/GPT-5.4/GPT-5.5 pricing	Checked
DeepSeek API Docs	V4 Flash/Pro pricing	Checked
PagerDuty blog	Agentic Cloud Operations	200

Verification Archive

Historical verification records and correction history for LLMOps and AgentOps in Production

Key takeaways

This page is a historical archive of past verification rounds; use the current Verification Report as the operating baseline.
The 2nd verification (2026-03-13) confirmed tool versions such as Langfuse v4.0.0, Arize Phoenix v13.0.3, DeepEval v3.8.9, and the Lakera-to-Check Point acquisition.
Several 3rd-verification (2026-03-26) items were later superseded in the 4th verification, including GPT-5.4 nano pricing, DeepSeek V3.2 pricing, and A2A versioning.
The generalized "LLM API price decline" claim was removed and replaced with provider-specific pricing, caching, and batch conditions.

Archive Notice

This page contains historical verification records. Use Verification Report as the current operating baseline.

2nd Verification (2026-03-13)

Tool and Framework Version Checks

Item	Verification	Result
Langfuse v4.0.0	2026-03-10 release, MIT open source verified	Pass
Arize Phoenix v13.0.3	2026-02-14 release, CLI v0.1.0+ verified	Pass
DeepEval v3.8.9	2026-03-05 release, 13K+ GitHub stars verified	Pass
RAGAS v0.4.3	2026-01-13 release, PyPI verified	Pass
Inspect AI v0.3.186	2026-03-03 release, UK AISI verified	Pass
NeMo Guardrails v0.20.0	OTel migration verified	Pass
MCP spec 2025-11-25	Reworked around authorization/security requirements during 2026-05-17 verification	Expanded
A2A v0.3.0	Corrected to latest v1.0.0 during 2026-05-17 verification	Superseded

Cost Optimization Data Checks

Item	Verification	Result
Anthropic prompt caching	Reworked around model-specific caching multipliers during 2026-05-17 verification	Expanded
OpenAI prompt caching	Reworked around model-specific cached input pricing during 2026-05-17 verification	Expanded
Lakera to Check Point acquisition	2025.09 acquisition complete, approximately $300M	Pass

2nd Verification External Sources

Source	Checked Area	Status
Langfuse Changelog	v4.0.0 release	200
Arize Phoenix GitHub Releases	v13.0.3 release	200
DeepEval GitHub	v3.8.9, evaluation metrics	200
OpenTelemetry GenAI Docs	Semantic Conventions experimental (2nd verification baseline)	200
Anthropic API Docs (Prompt Caching)	Caching pricing policy	200
Check Point acquisition release	Lakera Guard acquisition	200

3rd Verification (2026-03-26)

2026-05-17 Correction

New Content Checks

Item	Verification	Result
OWASP AOS	Three axes verified: Instrumentable, Traceable, Inspectable	Pass
LangSmith Fleet rebrand	Agent Builder to LangSmith Fleet and four new capabilities verified	Pass
Braintrust Loop AI	Natural-language scorer generation, four SDK additions, OTel native support verified	Pass
GPT-5.4 nano pricing	Corrected to GPT-5.4 mini pricing table during 2026-05-17 verification	Superseded
DeepSeek V3.2 pricing	Replaced with DeepSeek V4 Flash/Pro pricing during 2026-05-17 verification	Superseded
Anthropic 1M surcharge removal	Reworked into official Claude 4.x model pricing during 2026-05-17 verification	Superseded
LLM API price decline claim	Removed generalized decline-rate language and replaced with provider-specific pricing, caching, and batch conditions	Corrected
LiveCodeBench/AIME 2026	Benchmark existence and usage verified	Pass
TAU-bench Retail/JBDistill	Agent and safety benchmarks verified	Pass
PagerDuty AI agentic operations	Agentic cloud operations model and automated recovery capability verified	Pass

3rd Verification External Sources

Source	Checked Area	Status
OWASP official project	Agent Observability Standard	200
LangChain blog	LangSmith Fleet rebrand announcement	200
Braintrust docs	Loop AI, new SDKs	200
Anthropic pricing page	Claude 4.x model pricing	Checked
OpenAI pricing page	GPT-5.4 mini/GPT-5.4/GPT-5.5 pricing	Checked
DeepSeek API Docs	V4 Flash/Pro pricing	Checked
PagerDuty blog	Agentic Cloud Operations	200

2nd Verification (2026-03-13)

Tool and Framework Version Checks

Cost Optimization Data Checks

2nd Verification External Sources

3rd Verification (2026-03-26)

New Content Checks

3rd Verification External Sources

On This Page

Verification Archive

2nd Verification (2026-03-13)

Tool and Framework Version Checks

Cost Optimization Data Checks

2nd Verification External Sources

3rd Verification (2026-03-26)

New Content Checks

3rd Verification External Sources

On This Page