Verification Archive
Historical verification records and correction history for LLMOps and AgentOps in Production
Key takeaways
- This page is a historical archive of past verification rounds; use the current Verification Report as the operating baseline.
- The 2nd verification (2026-03-13) confirmed tool versions such as Langfuse v4.0.0, Arize Phoenix v13.0.3, DeepEval v3.8.9, and the Lakera-to-Check Point acquisition.
- Several 3rd-verification (2026-03-26) items were later superseded in the 4th verification, including GPT-5.4 nano pricing, DeepSeek V3.2 pricing, and A2A versioning.
- The generalized "LLM API price decline" claim was removed and replaced with provider-specific pricing, caching, and batch conditions.
Archive Notice
This page contains historical verification records. Use
Verification Report as the current operating baseline.
2nd Verification (2026-03-13)
Tool and Framework Version Checks
| Item | Verification | Result |
|---|---|---|
| Langfuse v4.0.0 | 2026-03-10 release, MIT open source verified | Pass |
| Arize Phoenix v13.0.3 | 2026-02-14 release, CLI v0.1.0+ verified | Pass |
| DeepEval v3.8.9 | 2026-03-05 release, 13K+ GitHub stars verified | Pass |
| RAGAS v0.4.3 | 2026-01-13 release, PyPI verified | Pass |
| Inspect AI v0.3.186 | 2026-03-03 release, UK AISI verified | Pass |
| NeMo Guardrails v0.20.0 | OTel migration verified | Pass |
| MCP spec 2025-11-25 | Reworked around authorization/security requirements during 2026-05-17 verification | Expanded |
| A2A v0.3.0 | Corrected to latest v1.0.0 during 2026-05-17 verification | Superseded |
Cost Optimization Data Checks
| Item | Verification | Result |
|---|---|---|
| Anthropic prompt caching | Reworked around model-specific caching multipliers during 2026-05-17 verification | Expanded |
| OpenAI prompt caching | Reworked around model-specific cached input pricing during 2026-05-17 verification | Expanded |
| Lakera to Check Point acquisition | 2025.09 acquisition complete, approximately $300M | Pass |
2nd Verification External Sources
| Source | Checked Area | Status |
|---|---|---|
| Langfuse Changelog | v4.0.0 release | 200 |
| Arize Phoenix GitHub Releases | v13.0.3 release | 200 |
| DeepEval GitHub | v3.8.9, evaluation metrics | 200 |
| OpenTelemetry GenAI Docs | Semantic Conventions experimental (2nd verification baseline) | 200 |
| Anthropic API Docs (Prompt Caching) | Caching pricing policy | 200 |
| Check Point acquisition release | Lakera Guard acquisition | 200 |
3rd Verification (2026-03-26)
2026-05-17 Correction
Model pricing, DeepSeek model naming, A2A versioning, and some vendor links from the 3rd verification were replaced with current baselines during the 4th verification. This section remains as history only.
New Content Checks
| Item | Verification | Result |
|---|---|---|
| OWASP AOS | Three axes verified: Instrumentable, Traceable, Inspectable | Pass |
| LangSmith Fleet rebrand | Agent Builder to LangSmith Fleet and four new capabilities verified | Pass |
| Braintrust Loop AI | Natural-language scorer generation, four SDK additions, OTel native support verified | Pass |
| GPT-5.4 nano pricing | Corrected to GPT-5.4 mini pricing table during 2026-05-17 verification | Superseded |
| DeepSeek V3.2 pricing | Replaced with DeepSeek V4 Flash/Pro pricing during 2026-05-17 verification | Superseded |
| Anthropic 1M surcharge removal | Reworked into official Claude 4.x model pricing during 2026-05-17 verification | Superseded |
| LLM API price decline claim | Removed generalized decline-rate language and replaced with provider-specific pricing, caching, and batch conditions | Corrected |
| LiveCodeBench/AIME 2026 | Benchmark existence and usage verified | Pass |
| TAU-bench Retail/JBDistill | Agent and safety benchmarks verified | Pass |
| PagerDuty AI agentic operations | Agentic cloud operations model and automated recovery capability verified | Pass |
3rd Verification External Sources
| Source | Checked Area | Status |
|---|---|---|
| OWASP official project | Agent Observability Standard | 200 |
| LangChain blog | LangSmith Fleet rebrand announcement | 200 |
| Braintrust docs | Loop AI, new SDKs | 200 |
| Anthropic pricing page | Claude 4.x model pricing | Checked |
| OpenAI pricing page | GPT-5.4 mini/GPT-5.4/GPT-5.5 pricing | Checked |
| DeepSeek API Docs | V4 Flash/Pro pricing | Checked |
| PagerDuty blog | Agentic Cloud Operations | 200 |