Debugging an AI agent is different from debugging traditional software. The code doesn’t usually break — the behaviour does. You can’t just read the stack trace. You need a systematic approach that accepts probabilistic inputs, non-linear execution, and failure modes that only show up in aggregate.
This is the playbook I use, refined across several agent products and documented in detail in the agent optimization experiment series.
The six categories of agent failure
Almost every agent bug I’ve diagnosed falls into one of six categories. Having the vocabulary speeds up triage.
- Hallucinated tool calls — the agent claims a tool returned data it never did, or calls a tool that doesn’t exist.
- Looping — the agent repeats the same step indefinitely (same URL, same subtool, same query variation).
- Wrong target / intent — the agent misidentifies what it’s being asked to do, often due to ambiguous user input or lost context.
- Context truncation — the agent’s context window fills up, and the step that loses relevant context fails silently.
- Retrieval miss — the retrieval step returns irrelevant or stale documents, and the generation step faithfully answers the wrong question.
- Prompt regressions — a well-intentioned prompt change fixes one failure mode and introduces another. This is the category you only see by comparing before and after.
When you start a debugging session, try to put the failure in one of these buckets before going deeper. It changes which questions you ask next.
The five-step loop
Debugging agents reliably means running this loop:
1. Reproduce
Save everything about the failing run: the exact input, any retrieved context, the full trace with all span metadata, model IDs, and config. Non-determinism makes reproduction harder than in classical software — the same input can produce different outputs.
If the failure rate at a fixed input is below about 20%, treat it as prompt fragility rather than a deterministic bug. You’re not chasing a single broken state; you’re chasing a distribution.
For anything else, pin seeds where you can and record enough to re-run the step that failed.
2. Cluster
Zooming into a single failing trace is often misleading. One trace rarely tells you the story. Two moves you can make immediately:
- Semantic clustering on inputs and outputs. Group runs by intent and outcome. A cluster of 100 failures with the same root cause is easier to diagnose than one failure with unknown scope.
- Failure-mode bucketing against the six categories above. Look at 10-20 failed traces and categorise them. You’ll usually see a clear distribution — e.g. “5 are loops, 4 are wrong target, 1 is retrieval miss.”
This tells you which category is the highest-leverage fix. Working on the top failure mode first is almost always the right move.
In experiment #3 of the Wiki Racer series, ten-trace qualitative review turned up three distinct failure categories and completely changed what I prioritised.
3. Locate the broken step
Here’s where most people debug wrong. The step that produced the bad output is often not the step that caused the failure.
An agent has upstream and downstream steps. Consider:
- A RAG agent returns a hallucinated answer. The generation step is technically correct given its input. But the retrieval step upstream fetched irrelevant docs.
- A planner-executor agent fails the task. The executor is correct given the plan. But the planner misidentified the goal.
- A tool-calling agent picks the wrong tool. The tool selection step is correct given the context. But the summarisation step before it truncated the key information.
The fix: run intermediate evals on each step, not just a final output eval. Correlate intermediate eval scores with final eval scores. The step whose score correlates most strongly with failure is the step that actually broke.
This is the idea behind eval correlation heatmaps — they surface which upstream step is “poisoning” downstream outputs.
4. Hypothesize
Once you know which step is failing and which category the failure is in, form a hypothesis. A good hypothesis is:
- Specific — “the retrieval step is returning docs about the wrong person when names are ambiguous,” not “retrieval is bad.”
- Testable — you can describe the experiment that would confirm or refute it.
- Narrow — one change at a time. If you change three things and the failure rate moves, you can’t tell which change did the work.
Resist the urge to just tweak the prompt. Tweaking without a hypothesis turns debugging into wishful thinking; you end up with a Frankenstein prompt that nobody understands and that regresses unpredictably later.
5. Test the fix
This is the step people skip. The trap is measuring the fix against a fresh random sample instead of the same failing traces you started with.
Why it matters: agent runs are probabilistic, and failure rates move with input mix. If you fixed nothing and re-ran on a new sample, the failure rate would change just from variance. The only way to know whether a fix is real is to run the fixed version against the same failing traces and measure the delta.
In practice:
- Keep a “regression set” of known-bad traces. It’s the most valuable asset in an agent codebase after the prompt itself.
- After each fix, replay against the regression set. Measure how many previously-failing traces now pass.
- Track the regression set over time. If its size stops shrinking, the bug category has changed.
When to automate this loop
The steps above are doable manually at small scale — hundreds of traces per week. Beyond that, the loop breaks on human bandwidth:
- Clustering thousands of traces by hand isn’t feasible.
- Intermediate evals at scale require programmatic eval pipelines.
- Keeping regression sets fresh across a fast-moving product is continuous work.
This is the moment teams reach for agent-analytics tooling. A tool that runs opinionated playbooks — clustering, eval correlation, regression detection — on your production traces automates the parts of this loop that don’t scale with team size.
Disclosure: I build one (TwoTail), so I’m biased. Other options in the category are listed in best agent observability tools 2026.
Best for
- New agent product, sporadic failures: manual loop is enough. Focus on clustering 50-100 failed traces by hand to find the dominant failure mode.
- Production agent with weekly failure reviews: invest in intermediate evals and a regression set. Semi-automated loop.
- High-volume production agent with multiple engineers: automate the clustering and intermediate-eval steps. Manual inspection is for the tail, not the bulk.
The common mistake across all three is confusing single-trace inspection with diagnosis. A trace tells you what happened; a cluster tells you what’s going wrong.