Debugging an AI agent is different from debugging traditional software. The code doesn’t usually break — the behaviour does. You can’t just read the stack trace. You need a systematic approach that accepts probabilistic inputs, non-linear execution, and failure modes that only show up in aggregate.

This is the playbook I use, refined across several agent products and documented in detail in the agent optimization experiment series.

The six categories of agent failure

Almost every agent bug I’ve diagnosed falls into one of six categories. Having the vocabulary speeds up triage.

  1. Hallucinated tool calls — the agent claims a tool returned data it never did, or calls a tool that doesn’t exist.
  2. Looping — the agent repeats the same step indefinitely (same URL, same subtool, same query variation).
  3. Wrong target / intent — the agent misidentifies what it’s being asked to do, often due to ambiguous user input or lost context.
  4. Context truncation — the agent’s context window fills up, and the step that loses relevant context fails silently.
  5. Retrieval miss — the retrieval step returns irrelevant or stale documents, and the generation step faithfully answers the wrong question.
  6. Prompt regressions — a well-intentioned prompt change fixes one failure mode and introduces another. This is the category you only see by comparing before and after.

When you start a debugging session, try to put the failure in one of these buckets before going deeper. It changes which questions you ask next.

The five-step loop

Debugging agents reliably means running this loop:

1. Reproduce

Save everything about the failing run: the exact input, any retrieved context, the full trace with all span metadata, model IDs, and config. Non-determinism makes reproduction harder than in classical software — the same input can produce different outputs.

If the failure rate at a fixed input is below about 20%, treat it as prompt fragility rather than a deterministic bug. You’re not chasing a single broken state; you’re chasing a distribution.

For anything else, pin seeds where you can and record enough to re-run the step that failed.

2. Cluster

Zooming into a single failing trace is often misleading. One trace rarely tells you the story. Two moves you can make immediately:

This tells you which category is the highest-leverage fix. Working on the top failure mode first is almost always the right move.

In experiment #3 of the Wiki Racer series, ten-trace qualitative review turned up three distinct failure categories and completely changed what I prioritised.

3. Locate the broken step

Here’s where most people debug wrong. The step that produced the bad output is often not the step that caused the failure.

An agent has upstream and downstream steps. Consider:

The fix: run intermediate evals on each step, not just a final output eval. Correlate intermediate eval scores with final eval scores. The step whose score correlates most strongly with failure is the step that actually broke.

This is the idea behind eval correlation heatmaps — they surface which upstream step is “poisoning” downstream outputs.

4. Hypothesize

Once you know which step is failing and which category the failure is in, form a hypothesis. A good hypothesis is:

Resist the urge to just tweak the prompt. Tweaking without a hypothesis turns debugging into wishful thinking; you end up with a Frankenstein prompt that nobody understands and that regresses unpredictably later.

5. Test the fix

This is the step people skip. The trap is measuring the fix against a fresh random sample instead of the same failing traces you started with.

Why it matters: agent runs are probabilistic, and failure rates move with input mix. If you fixed nothing and re-ran on a new sample, the failure rate would change just from variance. The only way to know whether a fix is real is to run the fixed version against the same failing traces and measure the delta.

In practice:

When to automate this loop

The steps above are doable manually at small scale — hundreds of traces per week. Beyond that, the loop breaks on human bandwidth:

This is the moment teams reach for agent-analytics tooling. A tool that runs opinionated playbooks — clustering, eval correlation, regression detection — on your production traces automates the parts of this loop that don’t scale with team size.

Disclosure: I build one (TwoTail), so I’m biased. Other options in the category are listed in best agent observability tools 2026.

Best for

The common mistake across all three is confusing single-trace inspection with diagnosis. A trace tells you what happened; a cluster tells you what’s going wrong.