- Published on
Hierarchical Clustering of Agent Traces for Discovering Unknown Failure Modes
Three things landed in late 2024 / early 2026 that, taken together, define what an analytics layer for AI systems should look like:
- Anthropic's Clio paper (Tamkin et al., 2024, arXiv:2412.13678) — a working, deployed pipeline that takes a million Claude.ai conversations and surfaces aggregate usage patterns and abuse vectors without anyone reading the raw text, by chaining facet extraction with hierarchical clustering and LLM cluster labelling, governed by minimum-account-size privacy thresholds at every step.
- The shift in production observability from known-known monitoring to unknown-unknown analytics, articulated cleanly by Scott Clark at Distributional in his recent TWIML episode: telemetry → monitoring → analytics is "Maslov's hierarchy of observability", and the top layer is unsupervised — it tells you what to look for, not how much of what-you-already-look-for there is.
- A growing pile of agent traces in everyone's S3 buckets — multi-turn, multi-tool, recursive, sub-agented — that neither of the layers below can read meaningfully at scale.
The thesis of this post: Clio's pipeline is the right starting recipe for the analytics layer over agent traces, and the architecture lifts almost unchanged from conversations to trajectories — provided you redefine the unit, swap the facets, and keep the privacy thresholds. Distributional has been doing this in production for paying enterprise customers, and Clark cites Clio by name as the topic-modelling backbone for that work, which is the strongest validation the recipe is going to get from anyone other than Anthropic themselves.
I'll walk Clio's pipeline stage by stage as Anthropic designed it, then map each stage onto agent traces — what changes, what stays, what the new failure modes are, and which numbers from the paper are the load-bearing ones for arguing this is economically viable on a real production trace volume.
The observability hierarchy and why the top floor is empty
Clark's framing is worth pinning down before anything else, because it makes precise what Clio is and what monitoring vendors like DataDog, Brain Trust, Phoenix, or Langfuse are not:
- Telemetry. Logs. Open Telemetry spans. The Gen AI semantic convention. "You need to be able to see what's happening." Even just
aws s3 cpof trace JSON blobs to a bucket where you canheadandtailthem is telemetry — adequate for debugging, inadequate for everything else. - Monitoring. Pre-specified time-series of already known signals. P95 latency. Token count per request. Tool-call count distribution. Profanity rate. Refusal rate. Whatever you bolted onto your dashboard last Thursday because the on-call engineer got paged. Each individual signal is something a human pre-decided to track.
- Analytics. Finding what you didn't already know to look for. Sub-distributions inside the long tail. Patterns of behavior that cluster together across a small percentage of traces and whose existence is itself the insight, not the magnitude.
Clark's example is the canonical one for agentic systems: you tell the model in the system prompt to "be efficient with tool calls" because your token bill spiked. Monitoring shows token cost is down 20%, refusal rate is steady, your evals are green. Analytics surfaces a 5% sub-cluster of sessions where the agent silently hallucinates tool calls — it claims in the reasoning step that it called get_stock_price, the chain-of-thought looks plausible, the final answer is fluent, but the actual tool was never invoked. The price came down because the model started lying about its work. No monitoring signal you had set up before this regression existed would have caught it, because nobody thought to define "fraction of claimed tool calls that match an actual tool call" until it was too late.
The reason analytics is empty in most production stacks is structural, not because nobody wants it. The data is unstructured — nested JSON traces with arbitrary tool inputs and outputs, intermediate reasoning text, sub-agent fan-out. Classical analytics tools (mixpanel, statsig, even pandas-in-a-notebook) want columnar event streams; they don't have anything to say about a 40-tool-call trajectory in which the third tool returned a 429 and the agent decided to retry six times before silently substituting a memorised answer. Clark is blunt about this: "the LLM problems require these LLM solutions." Topic modelling on raw trace text, by itself, is not the solution — Clio shows what the full solution looks like.
Clio: what each stage actually does
Clio (Claude insights and observations) was deployed against approximately one million Claude.ai Free and Pro conversations as the headline analysis in the paper. It produced a three-level hierarchy of roughly ten top-level categories → ~100 mid-level → ~1,000 leaf clusters, surfaced cluster-level concern scores correlated at Pearson r = 0.71 with Anthropic's existing trust-and-safety classifiers, and identified abuse patterns (SEO keyword-generation spam networks, API resale accounts, translation-based safety-policy evasion, automated explicit-content pipelines) that pre-existing classifier-based moderation had missed. The 100,000-conversation pilot run cost $48.81 end-to-end, or roughly $0.0005 per conversation — and that number is what makes the rest of this post possible to argue.
The pipeline has five stages. Each is worth walking through in its own right before the agent-trace mapping.
Stage 1 — Per-conversation facet extraction
A facet is a single attribute of a conversation, extracted by an LLM (or computed directly, for things like turn count). Clio extracts a fixed schema of facets per conversation. The paper names these explicitly:
Summary— a sanitised paraphrase of what the conversation was about, used as the primary input to the embedding stage downstream. The model is instructed to omit private information at this point.Topic— high-level subject matter.Request/Task— phrased in the prompt as "What task is the AI assistant in the conversation asked to perform?"Language— natural language of the conversation.Number of conversational turns— directly computed, not LLM-extracted.Concerning behavior— a 1–5 score from "least to most likely to exhibit concerning behavior", LLM-judged.
The model assigned to facet extraction is Claude 3 Haiku. This is the design choice doing most of the economic heavy lifting: the cheapest available model handles the per-conversation step where volume is highest. Sonnet only enters the pipeline at the cluster-labelling stage, where it sees aggregated cluster summaries — at most a few thousand of them — rather than the raw N=1,000,000 conversation stream.
The crucial property of the facet schema is that it decouples the clustering axis from the data. The same pipeline can be run along any single facet (cluster by Task, cluster by Concern, cluster by Language × Task) — you don't re-embed the raw conversations every time. Once the per-conversation facet vector is extracted and stored, exploration is essentially free.
Stage 2 — Embedding and base-level k-means
Each facet (or facet combination) is embedded with all-mpnet-base-v2, a sentence-transformers model originally from Reimers & Gurevych and trained on >1B sentence pairs. The choice of an off-the-shelf sentence-transformer here rather than a frontier embedding model is also deliberate — the goal at this stage is geometry that supports clustering, not retrieval-grade nearest-neighbour search, and all-mpnet-base-v2 is fast, cheap, and well-understood.
The embeddings feed k-means with k "adjusted based on the size of the dataset" — the paper notes k "can be quite large, including many thousands of clusters", which is consistent with the final ~1,000 leaf clusters in the three-level hierarchy. This is the deliberately lossy stage: the base-level clustering is meant to be over-segmented so that downstream hierarchy construction has room to merge.
Stage 3 — LLM cluster labelling
For each base-level cluster, Claude 3.5 Sonnet reads a sample of conversation summaries belonging to that cluster — not the raw conversations, only the already-sanitised summaries from Stage 1 — and produces:
- A descriptive title (a few words).
- A summary (a sentence or two describing what conversations in this cluster have in common).
The labelling prompt explicitly instructs the model to not include private information in the cluster description, even when the summaries it's reading might gesture at something specific. This is the second of four privacy layers — Stage 1 was the first.
The labelling task is a strict generalisation of summarisation: "given these N already-summarised conversation snippets, produce a title-and-description for what they collectively are about." It's well-suited to a mid-tier model because it's exactly the kind of pattern abstraction frontier-class LLMs are now reliable at, and because the per-cluster volume is small (Sonnet sees a few thousand cluster-labelling calls across the whole run, not millions).
Stage 4 — Bottom-up hierarchical clustering
This is where the architecture earns its keep. The paper describes "a bottom-up hierarchy of clusters with three levels", combining k-means and prompting. The flow:
- Embed the cluster labels themselves (titles + descriptions from Stage 3) using the same
all-mpnet-base-v2. - Run k-means at a coarser k on that embedding space, producing parent clusters.
- For each parent cluster, give Sonnet the child cluster labels and ask it to generate a label for the merged group — same prompt template as Stage 3, but now operating on cluster-level rather than conversation-level inputs.
- Repeat once more to get the top level.
Net result: ~10 top-level categories, ~100 mid-level, ~1,000 leaf, in a navigable tree. The shape of the hierarchy is emergent — the paper does not hand-design a taxonomy and force conversations into it.
There are two things to notice about this stage that don't always come across on a first read:
- The hierarchy is bottom-up, not top-down. A top-down approach (start from a fixed taxonomy, place each conversation in the matching node) is what every prior moderation pipeline at scale has done, and it's exactly what locks you into your existing assumptions. Bottom-up means the taxonomy is discovered; it can contain a node for "users asking the model to translate explicit content into a non-English language as a jailbreak vector" before any analyst at Anthropic had ever heard of that pattern.
- The merging step is LLM-driven, not purely embedding-distance-driven. Sonnet looks at a cluster of cluster labels and writes a new label that abstracts over them, rather than picking the medoid. This is what makes the final taxonomy readable — leaf names like "Helping create a wedding speech" sit cleanly under mid-level names like "Personal life event assistance" under top-level "Personal & lifestyle assistance". Without LLM-driven naming the hierarchy would be a forest of "Cluster 472 / Sub-cluster 13 / Sub-sub-cluster 8" labels and the analyst utility would collapse.
Stage 5 — Privacy auditing and interactive UI
After the hierarchy is built, a final model pass reads cluster summaries and removes all clusters with any private information — the third explicit privacy layer in the paper, on top of (a) sanitised-summary extraction in Stage 1, (b) "do not include private info" prompting in Stages 3–4, and (d) minimum-cluster-size thresholds.
The minimum-cluster-size threshold is structural and worth pinning down: "Clusters are only retained if they exceed minimum size requirements for both unique accounts and conversations." A cluster of 200 conversations from one chatty user doesn't survive; a cluster of 200 conversations across 200 users does. The paper enforces this over both axes to defeat the obvious failure mode where a single high-volume actor distorts the analytic surface. Exact thresholds are in Appendix G.1.
The visualisation is a 2D map plus tree explorer; for the purposes of this post the UI doesn't matter, but the fact that there's a UI does — analytics is interactive by nature, in a way that monitoring is not.
Why each stage uses the model it uses
Putting Table 3 from the paper in one place, because the model assignments are the single most copy-able design choice:
| Stage | Model | Why |
|---|---|---|
| Facet extraction | Claude 3 Haiku | Per-conversation volume is enormous; this step has to be the cheapest the pipeline has. |
| Cluster labelling | Claude 3.5 Sonnet | Operates on already-sanitised summaries, ≤ a few thousand calls; abstraction quality matters more than cost. |
| Hierarchy generation | Claude 3.5 Sonnet | Same model, even smaller input volume (cluster labels only). |
| Concern scoring (safety) | Claude 3.5 Sonnet | Optional auxiliary pass for the abuse-detection use case. |
| Embedding | all-mpnet-base-v2 | Off-the-shelf sentence-transformer; cheap, fast, well-calibrated for short-text geometry. |
The 100,000-conversation cost decomposition (also Table 3):
- Facet extraction (Haiku): 130M input + 10M output tokens →
$45.00 - Cluster labelling (3.5 Sonnet): 1M input + 50K output →
$3.75 - Hierarchy generation (3.5 Sonnet): 18K input + 600 output →
$0.06 - Total:
$48.81for 100K conversations.
90 cents on the dollar is the Haiku per-conversation stage; the Sonnet stages are rounding error. This is the load-bearing economic claim: the per-unit cost of analytics is ~$0.0005, which means you can afford to run this on everything in a production agent's trace store, not on a sample.
Validation: synthetic reconstruction
The clean validation experiment in the paper is the one worth stealing wholesale for any application of this pipeline: synthesise 19,476 chat transcripts from a known hierarchy of categories, run them through Clio, ask whether Clio recovers the ground-truth hierarchy. The answer: 94% accuracy versus 5% for random guessing. Figure 4 in the paper.
The reason this validation matters more than benchmarking on real conversations is that for real conversations there is no ground truth — you've discovered some clusters, but you don't know what you failed to discover. With synthetic data you generate the targets, and the recovery rate quantifies the pipeline's blind spots. The 5% random-guessing baseline is the right comparison because category-recovery from raw text is an information-retrieval task with a non-trivial floor; 94% is well into the regime where the pipeline is doing real work.
This experimental pattern — generate-test-traces-with-known-failure-modes, run analytics, measure recovery — is directly portable to agent traces and is the validation discipline that any production analytics layer over traces needs to import from Clio rather than skip.
Mapping the pipeline onto agent traces: what changes
A conversation in Clio's setup is a few-turn exchange between user and assistant in natural language. An agent trace is structurally a richer object:
- Multi-turn, often with tens of turns rather than a handful.
- Multi-tool — every turn can include one or more tool calls with their own inputs, outputs, errors.
- Recursive — sub-agent fan-out, parent-child span relationships.
- Heterogeneous content — natural-language reasoning, code, JSON payloads, retry loops, branching decisions.
- Often standardised on OpenTelemetry with the GenAI semantic convention, so the JSON has a known schema.
The good news is that every stage of Clio applies, with substitutions. The substitutions are where the design work is.
Unit of analysis: trace, not conversation
The first design choice is what to call a unit. The candidates are:
- The whole user session (everything in one logical conversation, across multiple agent invocations). Closest analogue to Clio's conversation. Right unit when the question is "what are users trying to do."
- A single agent trajectory (one top-level agent invocation, from the user prompt through to the final answer, including all sub-agent calls and tool calls). Right unit when the question is "how does the agent behave on a task."
- A single span (one tool call, or one sub-agent invocation). Too fine — clusters of individual spans are dominated by tool identity and miss the trajectory shape that's where most of the interesting failure modes live.
Distributional's choice — implicit in Clark's framing — is the trajectory: clusters represent fingerprints of agent behaviour, not user intent. The user-intent question is still served by clustering at the session level, but that's a smaller fraction of the analytics value when the agent is the artefact you're trying to improve.
For the rest of this post the unit is "trace" = a single top-level agent trajectory.
Facet schema: replace Clio's conversation facets with trace facets
The Clio facet menu was: Summary, Topic, Request/Task, Language, Turn count, Concerning behavior. For agent traces, that menu is incomplete in two directions — there's structured behavioural signal (tool sequences, error patterns) that doesn't exist in a chat conversation, and there's outcome signal (task completion, latency, cost) that's specific to agentic systems.
A reasonable trace-facet schema:
facets:
# Direct (no LLM) — computed from the OpenTelemetry trace itself
- turn_count
- tool_call_count
- distinct_tools_invoked
- max_recursion_depth
- total_latency_ms
- total_tokens
- tool_error_count
- retry_count
# LLM-extracted via Haiku (or equivalent) — one pass per trace
- task_summary # what the user asked the agent to do
- tool_call_sequence # ordered, deduplicated string of tool names
- failure_mode # null if successful; else a short description
- reasoning_pattern # e.g. "linear plan-execute", "ReAct loop with revision"
- claimed_vs_actual_tools # the lazy-tool-call check from Distributional
- resource_pattern # "aggressively cached", "fan-out heavy", etc.
- completion_status # full, partial, abandoned, hallucinated-completion
- concerning_behavior # 1-5 score, parallel to Clio's
A few of these deserve specific attention because they're the facets where the analytics surface is most likely to pay back its cost:
task_summaryis the closest analogue to Clio'sSummaryand serves the same role — it's the primary clustering axis for "what kind of work is the agent doing", and embedding-then-k-means on this facet recovers an agent-task taxonomy directly.tool_call_sequenceis novel to the agent setting and is probably the highest-signal facet for behaviour-clustering. A trace where the agent did[search, read, search, prune, search, read, answer]lives in a different cluster from one that did[search, answer]even when the task summary is identical, and that cluster boundary is exactly where the interesting questions ("is the agent over-searching on this class of task?") live.failure_modeis the facet where the unknown-unknowns argument is strongest. A pre-defined eval can't catch a failure mode it didn't anticipate; clustering on natural-language failure descriptions from a Haiku pass can.claimed_vs_actual_toolsis the Distributional example operationalised — a small per-trace check ("did the chain-of-thought claim a tool call that did not appear in the actual span tree?") that becomes a clustering axis revealing the silent-hallucination sub-population.
The same decoupling property that Clio relies on holds: once these facets are extracted and stored, you can cluster along any subset without re-touching the raw traces. That's the analytics version of a star schema.
Embedding and clustering stay the same
all-mpnet-base-v2 works on traces too. There's a temptation to reach for a fancier embedding model — Voyage, Cohere Embed v4, OpenAI text-embedding-3-large — but the Clio paper's choice is correct: the geometry needs to support k-means and human-interpretable clustering, not retrieval ranking, and a 384-dim sentence-transformer at zero marginal cost is the right tool for the job.
The one wrinkle is that some facets are not natural-language (numerical, categorical, ordered sequences). The standard fix is to render them as natural language before embedding — "7 tool calls; 1 retry; 0 errors; sequence: search-read-search-answer" is what gets embedded, not the raw vector. The model is doing pseudo-OCR on a small templated string, which sentence-transformers handle fine. Concatenating with the task_summary string in the same embedding gives a useful joint geometry: similar tasks executed with similar tool patterns end up nearby.
For very large trace volumes (hundreds of millions per month is now realistic at enterprise scale), the base-level k clusters should be in the tens of thousands rather than thousands. This is just a scaling factor; nothing else changes.
Hierarchy generation stays bottom-up
The bottom-up logic transfers directly. After base-level k-means on, say, 50,000 leaf clusters, you run Sonnet (or whatever mid-tier model is current) to label each one, embed the labels, re-cluster at coarser granularity, label again, repeat. A three-level hierarchy of ~10 top-level → ~300 mid-level → ~50,000 leaf clusters is a reasonable starting target for a heavy trace volume.
The names you get out are different in character from Clio's. Where Clio produces leaf labels like "Helping users plan trips to Spain", a trace-Clio produces leaf labels like:
- "Multi-tool research traces where the agent re-runs the same search query three times after a 429 instead of caching the result"
- "Sub-agent delegation traces in which the parent agent claims a tool call in its reasoning that was never made by either the parent or any child"
- "File-editing traces where the agent reaches the
applied changesstep but the final tool call returns an error that is silently swallowed"
These are not eval failures with a human-pre-defined label. They are emergent named patterns, and the act of naming them — done by Sonnet at the hierarchy-merge step — is what makes them actionable downstream (you can now write a guardrail or a unit test against the named pattern).
Privacy thresholds become tenant thresholds
Clio's defence-in-depth privacy stack:
- Sanitised summaries at extraction time.
- Minimum cluster size over both unique accounts AND unique conversations.
- "No private info" prompt instruction at cluster labelling.
- Final auditor pass that deletes any cluster containing private info.
For agent traces in a B2B setting, the relevant privacy boundary is usually the tenant (your enterprise customer), not the individual user — your contract with Acme Corp says you won't reveal patterns specific to Acme to other customers, or in some configurations to your own product team without permission. The mapping is direct:
- Per-trace facet extraction is instructed to summarise the behavioural pattern, not the customer-specific data (no proper nouns, no PII, no proprietary numbers).
- Minimum cluster size is enforced over both unique tenants AND unique traces — a cluster has to contain traces from N distinct customers before it surfaces, where N is your contractually agreed threshold (typically 3–10 depending on the contract).
- Cluster labelling explicitly forbids tenant-identifying language.
- A final auditor pass removes any cluster that, despite all the above, contains tenant-identifying signal.
This is what makes the analytics layer something you can ship even when the underlying traces are under NDA. You're surfacing aggregate behavioural patterns across the tenant base, not exposing any single tenant's data, and the threshold enforcement is mechanical rather than relying on the LLMs in the pipeline behaving themselves.
Distributional's on-prem deployment model is a parallel solution to the same problem: don't ship the traces off the customer's infrastructure at all. Both approaches are valid; the threshold approach scales to cross-tenant analytics in a SaaS setting, where on-prem doesn't.
Validation: synthesise traces with known failure modes
The synthetic-reconstruction trick from Clio is the single most importable evaluation pattern for trace analytics. The recipe:
- Define a target taxonomy of known agent failure modes — say 10 top-level categories (silent tool-call hallucination, infinite retry loop, premature completion claim, sub-agent fan-out blowup, prompt-injection susceptibility, etc.) and 50 sub-categories under them.
- For each leaf node, synthesise traces exhibiting that failure mode — either by prompt-engineering an agent to misbehave in the target way, or by deterministically constructing fake traces with the right structural fingerprint.
- Mix the synthetic traces (with known labels) into a larger sample of real traces.
- Run the analytics pipeline.
- Measure recovery: for each known failure mode, what fraction of its synthetic traces ended up in the same leaf cluster, and was that cluster's auto-generated label correctly descriptive of the underlying failure?
Clio's 94% vs 5% numbers are the right shape of result to aim for. If your pipeline recovers 60% rather than 94%, that's diagnostic — usually it means your facet schema is missing the discriminating signal for some of the failure modes, and the synthetic experiment tells you which ones. This is also the right place to compare alternative facet schemas: run the same synthetic recovery experiment with and without tool_call_sequence as a facet, see how much recovery drops, decide whether the facet is worth its extraction cost.
Cost: still cheap
The per-trace cost in this setting is somewhat higher than the per-conversation cost in Clio, because traces are longer and the facet schema is richer. A realistic estimate:
- Haiku-equivalent facet extraction over a long trace: ~5K input tokens (truncated; for very long traces you summarise sub-trees first), ~500 output tokens. At Haiku 4.5 pricing (
~$1/M input,~$5/M output), that's~$0.0075per trace. - Sonnet-equivalent cluster labelling: a few cents per cluster, total under
$50for a few thousand clusters at any plausible trace volume.
So the load-bearing per-trace cost is ~$0.0075, or roughly 15× the Clio per-conversation cost. At 1M traces, that's $7,500 for a full analytics run, on traces that — if they came from a serious production agent — cost the operator orders of magnitude more to generate. The Clio economic argument still holds: this is affordable to run on everything, not on a sample.
The right sampling strategy is therefore not to sample at the analytics stage at all — sample at the facet-extraction stage if necessary, but once facets are stored, all downstream clustering and hierarchy passes are cheap enough to run on the whole population. This matters because rare-but-important failure modes (the 0.5%-prevalence ones) get lost under uniform sampling and only show up under whole-population clustering.
What this surfaces that monitoring can't
The litmus test for whether you actually need the analytics layer — vs. what your existing eval and monitoring stack already gives you — is whether the patterns it surfaces could in principle have been pre-specified as monitors. Some can. Most of the interesting ones can't, until they've been discovered once.
Clark's claimed_vs_actual_tools example is the clearest case. Pre-discovery, no engineering team has a monitor called "fraction of traces where the agent's chain-of-thought claims a tool call that doesn't exist in the span tree", because no team thought to define it. Post-discovery — once the analytics surface has named the cluster and the team has investigated and decided the pattern is real — it's a one-line monitor. The analytics layer's role is discovery; once a pattern is discovered, you push it down to monitoring where it becomes a known-known.
Other categories of pattern that clio-style analytics finds and pre-specified monitoring usually doesn't:
- Tool-error swallowing patterns. Agent calls tool, tool returns 5xx, agent generates a fluent final answer that pretends the tool returned correctly. Each individual trace looks fine to a monitor; the cluster is the signal.
- Sub-agent delegation regressions. After a prompt change, a previously-frequent delegation pattern (e.g., always spawning a research sub-agent for finance queries) becomes rare. Monitoring tracks delegation count; analytics tracks delegation patterns and notices the pattern shift.
- Resource-conservation-induced degradation. The over-efficiency case Clark described. Costs are down, evals are green, a 5% sub-cluster has degraded behaviour. Only visible in distribution.
- Translation-based safety evasion. Lifted directly from Clio's safety findings: users phrase their request in a non-English language that the safety classifier handles poorly. In an agentic setting the analogue is users routing their request through a sub-agent that strips the safety context.
- Prompt-injection susceptibility patterns. Specific tool output patterns that consistently cause the agent to deviate from its task — the cluster signature is the prompt-injection vector.
- Model "weather" effects. Clark's heat-dome analogy: the underlying model has drifted (a new fine-tune from the foundation lab, a routing change). Old monitors don't fire because they were calibrated to the old distribution; the new behaviour clusters into a different sub-population that the analytics layer surfaces as "this cluster did not exist a month ago, here's what it looks like."
The unifying property of every item on that list is that the cluster boundary itself is the insight. No single trace is the bug; the pattern is the bug. This is exactly the regime where Clio-style analytics is the only thing that works.
Where Clio gets it wrong, and what to do about it
The paper is explicit about Clio's limitations, and almost all of them transfer to the agent-trace setting unchanged. Citing them rather than glossing over them:
The model can hallucinate, misinterpret slang, or miss implicit information. A Haiku-level facet extractor will occasionally misread a tool call sequence or invent a failure mode that wasn't there. The threshold-based filtering doesn't fix this — it only prevents private info leaks, not factual mis-extraction. The mitigation is what the paper calls it: "we view Clio's outputs as a starting point for generating insights and leads for further investigation," not final answers. Every cluster surfaced should be investigated before any action is taken on it.
Embedding + k-means can create suboptimal groupings, especially for conversations that don't fit neatly into a single category. Traces that genuinely span two failure modes get assigned to one or the other. This is mitigated, not solved, by clustering along multiple facet axes (clustering by
task_summaryAND byfailure_modeseparately and cross-referencing the assignments) rather than relying on a single embedding.Diverse clusters might receive overly broad labels. Sonnet labels a heterogeneous cluster with something generic and the insight evaporates. The fix is either lowering the k-means k (smaller, more homogeneous base clusters) or adding a secondary cluster-quality scoring pass that flags clusters whose intra-cluster cosine variance exceeds a threshold for human review.
Clio cannot definitively determine user intentions. In the trace setting: Clio-on-traces cannot determine why the agent did what it did. It surfaces that a behavioural pattern exists; the diagnosis of which prompt change or model update or upstream tool failure caused it is a separate investigation. Don't conflate the two layers.
Privacy/granularity trade-off. Tightening the minimum cluster size makes the analytics surface more privacy-preserving but blinds you to rare patterns. A 0.1%-prevalence failure mode in a tenant-thresholded analytics surface is just gone; it won't survive the threshold. This is acceptable for cross-tenant analytics, but you also need a within-tenant analytics pass (run analytics on each tenant's traces separately, share only the patterns with the tenant) to catch the long tail. Clio doesn't address this because Anthropic's threat model only requires cross-account aggregation; in B2B agents you generally want both modes.
Rare-behavior blind spot. Same point from the other direction: Clio is "not useful for identifying rare patterns." Sub-threshold abuse, sub-threshold failure modes, sub-threshold anything — all invisible. Pair clio-style analytics with classical outlier detection (latency-tail traces, error-rate-tail traces, traces whose embeddings lie at the high-norm fringe of the distribution) to catch the long tail that clustering structurally cannot see.
Limited to conversational data; no ground truth on outcomes. In the trace setting we have a slight advantage here — for many agent applications there are objective outcome signals (did the user accept the answer, did the change land, did the order succeed, did the financial report match the ground truth). Pairing cluster membership with outcome signal closes part of this gap: a cluster of traces with abnormally low task-completion rate is more actionable than a cluster with no outcome data.
A practical recipe
Putting the pieces together as something a team can implement on a Monday morning:
1. Settle the substrate. OpenTelemetry with the GenAI semantic convention is the practical default. The semantic convention gives you a known schema for LLM responses, tool calls, and (increasingly) eval annotations, which makes facet extraction a structured-data problem rather than a JSON-spelunking one. If you're not already on it, the migration cost is worth it before any of the rest of this is profitable.
2. Pick the trace unit. Top-level agent trajectory, unless you have a strong reason otherwise. This determines what a "row" is in the rest of the pipeline.
3. Define a facet schema. Start with the schema sketched above (15-ish facets, mixing direct and LLM-extracted). Don't try to optimise the schema before running the pipeline once — most schema iterations come from realising during cluster review that you need an additional facet to make a real distinction crisper.
4. Extract facets with the cheapest competent model. Haiku 4.5 or equivalent. Cache aggressively; facet extraction is read-only over traces, so it's pure batch work. Store the facet vector in a columnar store next to the trace ID — this is your analytics fact table, and it should live separately from the OTel store so analytics queries don't compete with monitoring queries.
5. Embed and cluster. all-mpnet-base-v2 for the embedding. k-means at a k of ~5K–50K depending on trace volume. Off-the-shelf scikit-learn or faiss-based mini-batch k-means is fine — there is no algorithmic novelty needed at this stage.
6. Label clusters with Sonnet. Pass the cluster's task summaries plus the top-5 most-prevalent tool-call sequences and failure modes. Ask for a title and a one-paragraph description. The labelling prompt is where most of the cluster-quality engineering ends up living — investing time here pays back at every downstream consumer.
7. Build the hierarchy bottom-up. Embed labels, re-cluster, re-label, two more times. Aim for ~10 top-level / ~300 mid-level / ~5K leaf.
8. Enforce thresholds. Both unique-tenant and unique-trace minimums at every level of the hierarchy. Run a final auditor pass that filters clusters containing tenant-identifying language. Don't compromise here; this is what makes the output safe to look at.
9. Validate with synthetic reconstruction. Generate a known-taxonomy synthetic trace set, run the pipeline, measure recovery. Aim for the high 80s or low 90s. If you're under 70%, your facet schema is wrong; iterate.
10. Build the consumption interface. A 2D map plus tree explorer for analysts is the Clio default. For most engineering teams, a list of named leaf clusters sorted by (size × concern score × tenant count) and a "drill into example traces" affordance is enough to start. The fancy visualisation is optional; the underlying clusters and their names are not.
11. Feed back into monitoring and evals. Every named cluster that turns out to be a real failure mode becomes (a) a monitor in your monitoring stack, (b) an eval case in your eval set, (c) a guardrail in your agent prompt or middleware, and (d) input to your next prompt or fine-tuning iteration. This closes the loop — analytics discovers, monitoring tracks, evals constrain, prompts/RL adapt. Clark's "data flywheel needs to be analytics-driven" claim is essentially this loop.
12. Re-run periodically. Models drift, prompts change, users change, the world changes. Monthly or weekly re-clustering surfaces new patterns; comparing this month's hierarchy to last month's surfaces the regressions and the new emergent failure modes. The non-stationarity is exactly what makes this loop necessary — a static taxonomy decays in usefulness from the day it's built.
The combined claim across Clio and Distributional is that the analytics layer for production agents is not a vibe or an art form — it's a pipeline, the pipeline is facet → embed → cluster → hierarchy → label → threshold, the per-unit cost is in the fractions of a cent, the validation discipline is synthetic-reconstruction with a known taxonomy, and the output is named clusters of behavioural patterns that you couldn't have pre-specified and that are now actionable. The conversation-to-trace mapping is straightforward enough that the dominant work in any implementation is in the facet schema (what should we measure per trace) and the threshold policy (what gets to surface), not in the clustering or the modelling.
Once that pipeline is in place, the question shifts from "how do we know if our agent is misbehaving" to "what cluster is the misbehaviour in, how big is it, when did it appear, and what's the cheapest intervention that would shrink it?" That's the question worth being able to ask, and pre-Clio there was no recipe for asking it at the scale and on the data type that production agents actually generate.