Home/ Articles/ How to debug AI agent failures — a practical playbook

Guides

How to debug AI agent failures — a practical playbook

A systematic playbook for debugging AI agent failures: reproduce, cluster, locate the broken step, form a hypothesis, and test the fix before shipping.

Timothy Daniell · Published April 23, 2026 · 10 min read

Key takeaways

Most agent failures fall into six categories: hallucinated tool calls, looping, wrong target/intent, context truncation, retrieval miss, and prompt regressions. Diagnose the category first.
Cluster before you zoom. Looking at one failed trace is often misleading; failure modes only become clear in aggregate.
Locate the broken step with intermediate evals, not the final score. Sequence is not causality — the step that produced the bad output isn't always the step that caused it.
Form a hypothesis before you change the prompt. Tweaking without a hypothesis turns debugging into wishful thinking.
Test fixes against the same set of failing traces, not a new random sample. Otherwise you can't tell whether you fixed the bug or just rolled different dice.

Debugging an AI agent is different from debugging traditional software. The code doesn’t usually break — the behaviour does. You can’t just read the stack trace. You need a systematic approach that accepts probabilistic inputs, non-linear execution, and failure modes that only show up in aggregate.

This is the playbook I use, refined across several agent products and documented in detail in the agent optimization experiment series.

The six categories of agent failure

Almost every agent bug I’ve diagnosed falls into one of six categories. Having the vocabulary speeds up triage.

Hallucinated tool calls — the agent claims a tool returned data it never did, or calls a tool that doesn’t exist.
Looping — the agent repeats the same step indefinitely (same URL, same subtool, same query variation).
Wrong target / intent — the agent misidentifies what it’s being asked to do, often due to ambiguous user input or lost context.
Context truncation — the agent’s context window fills up, and the step that loses relevant context fails silently.
Retrieval miss — the retrieval step returns irrelevant or stale documents, and the generation step faithfully answers the wrong question.
Prompt regressions — a well-intentioned prompt change fixes one failure mode and introduces another. This is the category you only see by comparing before and after.

When you start a debugging session, try to put the failure in one of these buckets before going deeper. It changes which questions you ask next.

The five-step loop

Debugging agents reliably means running this loop:

1. Reproduce

Save everything about the failing run: the exact input, any retrieved context, the full trace with all span metadata, model IDs, and config. Non-determinism makes reproduction harder than in classical software — the same input can produce different outputs.

If the failure rate at a fixed input is below about 20%, treat it as prompt fragility rather than a deterministic bug. You’re not chasing a single broken state; you’re chasing a distribution.

For anything else, pin seeds where you can and record enough to re-run the step that failed.

2. Cluster

Zooming into a single failing trace is often misleading. One trace rarely tells you the story. Two moves you can make immediately:

Semantic clustering on inputs and outputs. Group runs by intent and outcome. A cluster of 100 failures with the same root cause is easier to diagnose than one failure with unknown scope.
Failure-mode bucketing against the six categories above. Look at 10-20 failed traces and categorise them. You’ll usually see a clear distribution — e.g. “5 are loops, 4 are wrong target, 1 is retrieval miss.”

This tells you which category is the highest-leverage fix. Working on the top failure mode first is almost always the right move.

In experiment #3 of the Wiki Racer series, ten-trace qualitative review turned up three distinct failure categories and completely changed what I prioritised.

3. Locate the broken step

Here’s where most people debug wrong. The step that produced the bad output is often not the step that caused the failure.

An agent has upstream and downstream steps. Consider:

A RAG agent returns a hallucinated answer. The generation step is technically correct given its input. But the retrieval step upstream fetched irrelevant docs.
A planner-executor agent fails the task. The executor is correct given the plan. But the planner misidentified the goal.
A tool-calling agent picks the wrong tool. The tool selection step is correct given the context. But the summarisation step before it truncated the key information.

The fix: run intermediate evals on each step, not just a final output eval. Correlate intermediate eval scores with final eval scores. The step whose score correlates most strongly with failure is the step that actually broke.

This is the idea behind eval correlation heatmaps — they surface which upstream step is “poisoning” downstream outputs.

4. Hypothesize

Once you know which step is failing and which category the failure is in, form a hypothesis. A good hypothesis is:

Specific — “the retrieval step is returning docs about the wrong person when names are ambiguous,” not “retrieval is bad.”
Testable — you can describe the experiment that would confirm or refute it.
Narrow — one change at a time. If you change three things and the failure rate moves, you can’t tell which change did the work.

Resist the urge to just tweak the prompt. Tweaking without a hypothesis turns debugging into wishful thinking; you end up with a Frankenstein prompt that nobody understands and that regresses unpredictably later.

5. Test the fix

This is the step people skip. The trap is measuring the fix against a fresh random sample instead of the same failing traces you started with.

Why it matters: agent runs are probabilistic, and failure rates move with input mix. If you fixed nothing and re-ran on a new sample, the failure rate would change just from variance. The only way to know whether a fix is real is to run the fixed version against the same failing traces and measure the delta.

In practice:

Keep a “regression set” of known-bad traces. It’s the most valuable asset in an agent codebase after the prompt itself.
After each fix, replay against the regression set. Measure how many previously-failing traces now pass.
Track the regression set over time. If its size stops shrinking, the bug category has changed.

When to automate this loop

The steps above are doable manually at small scale — hundreds of traces per week. Beyond that, the loop breaks on human bandwidth:

Clustering thousands of traces by hand isn’t feasible.
Intermediate evals at scale require programmatic eval pipelines.
Keeping regression sets fresh across a fast-moving product is continuous work.

This is the moment teams reach for agent-analytics tooling. A tool that runs opinionated playbooks — clustering, eval correlation, regression detection — on your production traces automates the parts of this loop that don’t scale with team size.

Disclosure: I build one (TwoTail), so I’m biased. Other options in the category are listed in best agent observability tools 2026.

Best for

New agent product, sporadic failures: manual loop is enough. Focus on clustering 50-100 failed traces by hand to find the dominant failure mode.
Production agent with weekly failure reviews: invest in intermediate evals and a regression set. Semi-automated loop.
High-volume production agent with multiple engineers: automate the clustering and intermediate-eval steps. Manual inspection is for the tail, not the bulk.

The common mistake across all three is confusing single-trace inspection with diagnosis. A trace tells you what happened; a cluster tells you what’s going wrong.

Summary

Debugging agents is a five-step loop: reproduce, cluster, locate, hypothesize, test.
Six common failure categories cover most real bugs — use them as a triage vocabulary.
Aggregate failure clustering is faster and more accurate than single-trace inspection.
Intermediate evals reveal the step that actually broke, not just the final output.
Replay fixes against the original failing set to be sure the fix is real.

Frequently asked questions

How do I reproduce an agent failure if the agent is non-deterministic?

Pin the seed where you can, record the exact inputs and any retrieved context, and save the full trace. Non-determinism in the LLM call itself is harder — run the same input 10-20 times and measure how often the failure recurs. A failure rate under 20% at the same input often points to prompt fragility, not a deterministic bug.

What's the fastest way to find failure patterns in a large trace dataset?

Semantic clustering on inputs and outputs. Group traces by what the user asked and what the agent produced, then filter for clusters with high failure rates. You'll typically find three to five distinct failure modes that account for most of the failure volume — far more useful than scrolling through runs one by one.

Why isn't the step with the bad output always the step that caused the failure?

Agents have upstream and downstream steps. The step producing the final bad output might be technically correct given its input — but the input came from a broken upstream step (e.g. retrieval fetched irrelevant docs, or the planner misidentified the goal). Intermediate evals on each step reveal the step that actually went wrong.

How do I know if a prompt change actually fixed the bug?

Replay the change against the same set of traces that were failing. If the failure rate on that fixed set drops meaningfully, the fix is real. If you only measure against a fresh random sample, you can't separate the effect of the fix from normal variance. Keep a 'regression set' of known-bad traces — it's the most valuable asset in an agent codebase.

When should I use an eval vs just look at the trace?

Look at the trace first for a single failure when you don't know what's happening. Use evals once you have a hypothesis, or when you need to measure the same thing across thousands of runs. Evals are for scale and repeatability; manual inspection is for understanding.

What's the difference between an eval and a unit test?

A unit test checks deterministic behaviour (function returns 42 given 6x7). An eval scores probabilistic behaviour (the response is relevant, the answer doesn't hallucinate, the tool call was appropriate). Evals accept some noise and report a distribution, where unit tests pass or fail.

How many traces do I need before clustering is useful?

Clustering starts being informative around 500-1000 failed traces. Below that, the clusters are unstable and you're better off reading traces one by one. Above 5000 failed traces, clustering becomes essential — there's no other tractable way to see the structure of the failure modes.

Ship agents you actually understand.

TwoTail turns your OpenTelemetry traces into plain-English analysis, failure clusters, and eval patterns.

Book a demo

Timothy Daniell

Founder of TwoTail. Building agent analytics for teams shipping AI agents to production.