Observability

All Posts

Published on
May 11, 2026
Hierarchical Clustering of Agent Traces for Discovering Unknown Failure Modes
Clio AI-Agents Observability Agent-Traces Anthropic Distributional Clustering Hierarchical-Clustering k-Means Privacy Telemetry OpenTelemetry
Anthropic's Clio is a privacy-preserving pipeline — extract facets from each conversation with Haiku, embed with sentence-transformers, cluster bottom-up with k-means into a ~10/100/1000 three-level hierarchy, label each cluster with Sonnet, and enforce minimum unique-account thresholds at every step. The whole 100K-conversation run costs $48.81 and recovers a known taxonomy at 94% accuracy versus 5% for random guessing. The architecture lifts almost unchanged to agent traces, which is exactly what Distributional has been doing: traces become the unit of analysis, facets become tool-call sequences and failure fingerprints, and clusters surface the lazy-tool-call hallucinations and resource-conservation regressions that pre-defined evals never thought to look for. This post walks Clio's pipeline stage by stage, maps each stage onto the agent-trace setting, and pins down what the 'analytics' layer above telemetry and monitoring actually buys you.
Published on
December 21, 2025
Automatic Debugging and Failure Detection in AI Agent Systems
AI-Agents LLM Debugging Observability Reliability
A survey of DoVer and related work on failure attribution, intervention-based debugging, and observability tooling for LLM agent systems.
Published on
December 10, 2025
Why You Don’t Need AI Agent Evaluations
AI-Agents Evaluation Observability LLMs Startups
A satirical look at why skipping AI agent evaluations makes perfect sense if you don't value maintainability, customers, or long-term sanity.

Observability

observability (3)

Hierarchical Clustering of Agent Traces for Discovering Unknown Failure Modes

Automatic Debugging and Failure Detection in AI Agent Systems

Why You Don’t Need AI Agent Evaluations