Automatic Debugging and Failure Detection in AI Agent Systems

Large Language Model (LLM) based AI agents can autonomously plan multi-step solutions and use tools, but they often fail in unpredictable ways. A single small mistake in an agent’s reasoning or action can cascade into a complete task failure, much like building a house on a cracked foundation. Debugging these autonomous agents is crucial – especially as they take real actions in the world – yet it is extremely challenging to pinpoint and fix the root causes of failures. In this post, we delve into recent research on automatically debugging AI agents and detecting failures, analyzing a new framework called DoVer and related work, and we also survey emerging platforms (Langfuse, LangWatch, Arize Phoenix) that help monitor and troubleshoot agent behavior. We will cover why agents fail, how failures can be attributed to specific steps or agents, and new techniques to automatically identify and remediate those failures.

Understanding Why AI Agents Fail

Complex Failure Modes

Modern LLM-driven agents have sophisticated architectures (planning modules, memory, tool-use, reflection, etc.) that unfortunately create many potential failure points. Recent studies have cataloged common failure modes in multi-agent LLM systems. For example, the MAST framework (Cemri et al., 2025) categorizes typical errors into issues with task interpretation, planning, tool or environment interaction, and result verification. In other words, an agent might misunderstand the user’s request, devise a flawed plan, use tools incorrectly (e.g. calling an API with wrong parameters), or fail to double-check its answers – all of which can lead to failure. These errors can accumulate over a long chain of actions, since agent solutions often involve multiple steps of reasoning and tool use.

Inadequacy of Simple Pass/Fail Metrics

Traditional end-to-end evaluation (just marking a whole task as success or failure) is too coarse to diagnose what went wrong. Research has proposed more granular evaluation methods: for instance, defining requirement graphs or task utility measures that break a complex task into sub-requirements and track where progress stalled. One such framework, TRAIL (Deshpande et al., 2025), creates turn-level execution traces and a fine-grained taxonomy of errors (in reasoning, planning, execution), revealing that even very capable long-context models struggle to debug failures in their own traces. In other words, simply giving a final pass/fail score hides where the agent got stuck – was it at an early planning step, or a late execution step? – so new methods focus on analyzing the full trajectory of the agent’s decisions.

Failure Cascades

A key challenge is that agent errors often cascade. Because each step’s output can influence the next step, an early mistake can throw off all subsequent steps. For example, if an AI travel agent forgets to record a flight it found, all later steps (finding hotels, making an itinerary) might be invalid due to the missing flight info. This “single point of failure” problem is noted by Zhu et al. (2025) – LLM agents’ sophisticated architectures make them vulnerable to cascading failures, where a single root error propagates through later decisions. This means debugging must find that root cause early in the chain.

Multi-Agent Complexity

In multi-agent systems (where several specialized agents collaborate), failures become even harder to dissect. Not only can errors accumulate, but agents might mislead each other. A recent study, “Why do multi-agent LLM systems fail?” (Cemri et al., 2025), observed frequent Inter-Agent Misalignment: one agent’s ambiguous or incorrect instruction can cause another agent to err. For example, an orchestrator agent might tell a tool-using agent to click a non-existent button. The sub-agent, rather than stopping, might attempt a wrong action, compounding the mistake. In such a case, which agent actually “caused” the failure? Both contributed to the error, so attributing blame to just one is inappropriate. This shows that failures can stem from poor coordination between agents, not just an isolated bad step. Moreover, agents often attempt tasks in multiple trials – e.g. if the first plan fails, they re-plan and try a different approach. This leads to execution logs with several distinct attempts (trials) in one session. Different trials might fail for different reasons, making it ambiguous which single step overall was the decisive failure. In fact, human annotators often disagree on the exact failure step. In the Who&When benchmark, even after discussions, some failure cases had no clear single error step (i.e. “uncertain” ground truth). All these factors make manual failure analysis difficult and highlight the need for automated, systematic debugging.

From Failure Attribution to Automated Debugging

Given the challenges above, a line of research has emerged to automatically attribute failures in agent execution logs – essentially, to answer “which step (and which agent) caused the task to fail?” This is known as failure attribution. A recent benchmark called Who&When (Zhang et al., 2025) was introduced to study this problem. Who&When provides extensive failure logs from many multi-agent systems, with fine-grained annotations of which agent and step contained the decisive error. The authors of Who&When evaluated several automated methods and found the task extremely challenging: the best method could correctly identify the failing agent only ~53.5% of the time, and pinpoint the exact failing step only 14.2% of the time. In fact, some naive methods performed below random chance for step identification. Even state-of-the-art LLMs with advanced reasoning (like GPT-4 or specialized agent models) struggled to achieve practical accuracy in this attribution task. Another report showed GPT-4 (with a straightforward prompt) got below 10% accuracy on failure step attribution in Who&When. Clearly, reliably diagnosing agent failures from logs is hard.

Why is identifying the “failure step” so difficult? As discussed, often multiple things go wrong. A log may show many symptoms of failure, but identifying the earliest root cause requires understanding hypotheticals: if that step had gone right, would the agent have eventually succeeded? The formal definition of a decisive failure step is exactly that – a step such that fixing it would allow success. But without actually fixing and re-running, any attribution we make is just a hypothesis. The DoVer paper aptly notes that prior attribution methods infer a failing step from the log alone, which “remains an untested hypothesis unless validated by execution”. In other words, the agent’s log might suggest step X was the culprit – but to be sure, one would need to go back, change step X to a correct action, and see if the whole task then succeeds. This insight has led researchers to propose intervention-based debugging: don’t just guess the failing step – actually try to fix it and observe what happens.

The DoVer Approach: Do, Then Verify

A recent framework called DoVer (“Do-then-Verify”) embodies this intervention-driven debugging philosophy. The idea, introduced by Fourney et al. in 2025 (in the context of multi-agent systems), is to treat any failure attribution as a hypothesis and explicitly test it. DoVer’s approach can be summarized as follows: hypothesize which step or agent caused the failure, intervene by making a minimal change at that point, and then re-run the agent to see if the outcome improves. This turns debugging into an experimental loop. By observing the new run, DoVer can verify or refute the hypothesis about the failure’s cause. Importantly, this method sidesteps the uncertainty in human labels – instead of relying on noisy ground-truth annotations of the failure step, it directly tests what the “decisive error” is by attempting a fix.

How exactly does DoVer work? The system implements a four-stage pipeline:

Trial Segmentation: First, the full execution log of the agent is segmented into distinct trials. Each trial is essentially one plan-execute cycle (e.g. an agent’s attempt from a new plan through subsequent tool calls until either it re-plans or ends). Many agent systems use iterative ReAct loops that re-plan when stuck. DoVer automatically detects those re-plan points (using prompting on the log) to break the session into trials. This way, the debugging can focus on one attempt at a time, which simplifies reasoning about causality. It also means if the agent made multiple attempts, DoVer can examine each for potential fixes.
Failure Hypothesis Generation: For each trial, the system (using an LLM or some heuristic) generates a candidate hypothesis about which step (and which agent, in a multi-agent context) was the root cause of failure. For example, it might say “In trial 2, the WebSurfer agent clicking the wrong year at step 54 is the failure.” Along with the identified agent and step, DoVer can produce a natural-language rationale explaining why that step is suspected. This builds on prior log analysis techniques – essentially using the execution trace to propose what went wrong.
Intervention Generation: Next, DoVer synthesizes a minimal intervention to test that hypothesis. An intervention could be editing the content of a specific step (e.g. replacing the agent’s incorrect action or answer with a correct one), or adjusting the instruction given by an orchestrator agent, or even tweaking a tool invocation. The key is to change as little as possible – just enough to fix the suspected error – so that if the hypothesis is right, the agent should now succeed (or at least get further). DoVer’s design recognizes two broad categories of interventions: (a) those on the coordinator’s plan or its instructions to sub-agents, and (b) those on a sub-agent’s own behavior or capabilities. This distinction is important in multi-agent settings: if the failure was due to a bad plan, you fix the plan; if it was due to a sub-agent’s limitation, you might need to improve that sub-agent’s tool usage or knowledge.
Intervention Execution & Evaluation: Finally, DoVer replays the agent’s task using the intervention. It rolls back to the start of the trial (or the step) in question, applies the edit, and then lets the agent proceed normally from there. The outcome of this new execution is compared to the original. If the agent now achieves the goal (or at least makes additional progress towards it), that’s evidence the hypothesis was correct – the edited step was indeed the decisive failing point. DoVer doesn’t require a complete success to learn something; it also measures intermediate progress via defined milestones or utility scores. An improvement in this score after intervention indicates partial recovery. By scoring the difference, DoVer can rank hypotheses by how much they improve the outcome. In practice, multiple hypotheses can be tested (even in parallel) for a single failure, since a complex failure might be fixable in several ways. The end result is a validated identification of the failure’s cause and a proven fix, rather than just a guess.

This do-and-verify loop is an exciting development because it moves beyond passive log analysis. It’s akin to how a software engineer debugs code: if you think a certain function call is causing a crash, you might stub it out or modify it and run the program again to see if the crash goes away. DoVer brings that practice into AI agent trajectories. In their experiments, the DoVer authors found this approach can recover a significant portion of failure cases that static analysis would miss, and it provides explicit evidence for which hypothesis is correct. By focusing on minimal edits within the original trajectory, DoVer also keeps the debugging targeted – it’s not trying to retrain the agent from scratch, just patch the specific run at fault.

Advances in Failure Attribution and Debugging

The DoVer framework is part of a broader wave of research in 2024–2025 aimed at making AI agents more reliable through better failure analysis and self-debugging. We highlight some notable advances and how they relate:

Reasoning-Based “Judge” Agents: Systems such as RAFFLES (Zhu et al., 2025) use a reasoning-driven LLM to judge execution traces and identify the most likely failing step.
Counterfactual and Abductive Analysis: West et al. (2025) proposed Abduct, Act, Predict, generating counterfactual trajectories to test which step prevents failure.
Causal Inference Techniques: Ma et al. (2025) apply formal causal inference to agent logs to identify critical causal steps.
Spectrum-Based Fault Localization: Ge et al. (2025) adapt software debugging spectrum analysis to multi-agent systems.
Hierarchical and Graph-Based Attribution: GraphTracer (Zhang et al., 2025) traces failures through graph-structured agent states.

Human-in-the-Loop Debugging Tools and Frameworks

Tools such as AGDebugger and LangGraph enable time-travel debugging, checkpointing, and step-level replay. These systems show that minimal interventions can often recover agent performance, but they rely heavily on human expertise and do not scale.

Observability and Monitoring Platforms

Langfuse

Langfuse provides end-to-end tracing, cost analysis, latency tracking, and evaluation hooks for LLM agents, making intermediate failures visible.

LangWatch

LangWatch focuses on agent simulation, regression testing, and pre-production failure detection.

Arize Phoenix

Phoenix offers open-source LLM observability with distributed tracing, framework-agnostic instrumentation, and span-level inspection.

Conclusion and Outlook

Automatic debugging and failure detection is becoming essential as AI agents move into real-world, high-stakes environments. Research frameworks such as DoVer demonstrate that failure attribution must be verified, not inferred, while observability platforms provide the infrastructure to deploy these ideas in production.

The future of scalable AI agents lies in systems that do not merely act, but can detect, explain, and repair their own failures.

References

DoVer: https://arxiv.org/abs/2512.06749
Who&When: https://arxiv.org/abs/2503.13657
TRAIL: https://arxiv.org/abs/2505.08638
Langfuse: https://langfuse.com
LangWatch: https://langwatch.ai
Arize Phoenix: https://docs.arize.com/phoenix