Home/ Articles/ The funnel chart of agent analytics

Visualization

The funnel chart of agent analytics

Funnels assume linear journeys agents don't have. Replace them with trace waterfalls, semantic clusters, Pareto fronts, and eval heatmaps.

Timothy Daniell · Published December 17, 2025 · 7 min read

Key takeaways

Funnel charts rely on linearity. Agents are non-linear — they loop, retry, and take shortcuts — so funnels hide more than they reveal.
Trace waterfalls replace session replay: a flame graph for agent reasoning.
Semantic clustering tables replace funnel slicing: group by intent and outcome, not by step.
Pareto fronts replace conversion-vs-CAC: plot eval score against cost per run to find the efficient frontier.
Eval heatmaps replace drop-off analysis: correlate intermediate evals with final evals to find the step that actually caused the failure.

I love(d) Funnel Charts.

They were the ultimate “Divide and Conquer” tool. They allowed us to take a messy, complex user journey and slice it into neat, manageable steps. Step 1: Sign Up. Step 2: Onboarding. Step 3: Purchase.

If the graph dipped at Step 2, you knew exactly where to focus your engineering time. You didn’t need to fix the whole product; you just needed to fix the onboarding. You could slice by segments to find the problematic one, and even dive into individual user journeys.

But the sad thing is this: funnels don’t really fit most agents.

The Problem: Agents Don’t Walk in Straight Lines

The premise of a funnel is linearity. Every successful user walks the same path, and anyone who deviates is a “drop-off.”

Agents are fundamentally non-linear. They loop. They retry. They take shortcuts. One run might take 3 steps; the next might take 30. If you try to force an Agent into a linear funnel, you end up with a mess of “Other” paths and confusing data that hides the reality of the behaviour.

We can’t just “fix” the funnel. We need to replace the jobs it used to do with visualisations native to this new, messier reality.

Here are the four charts I like best for each job.

1. The One-by-One View: Trace Waterfalls

Replaces: Session Replay / Step Drill-down

A Trace Waterfall in twotail.ai

When a funnel showed a drop-off, you’d usually click in to see why. In the old world, you might watch a session recording. In the agent world, we use the Trace Waterfall.

This is a flame graph for reasoning. It visualizes the hierarchy of the agent’s thought process: spans inside of spans, tool calls inside of thoughts. It’s incredibly dense and detailed. It shows you exactly where the latency is coming from (is it the LLM generation? Or the vector search?) and where the logic branched.

The Critique: “But a Waterfall is N=1. I can’t look at 10,000 waterfalls.” The Counter: You don’t have to. The density of text data in a waterfall is perfect for an AI Analyst. We are approaching a point where we can have an agent read 10,000 waterfalls overnight and summarize the failure patterns for you. Which brings us to the next chart.

2. The Agggregate View: Semantic Clustering

Replaces: Funnel Slicing and Dicing

A trace clustering table

If we can’t organize runs by “Step 1 vs Step 2,” how do we aggregate them? We organize them by Intent and Outcome.

Instead of a funnel, we need Semantic Clustering tables. By clustering the inputs and outputs of every run, we can visualize distinct islands of user behavior.

Cluster A (40%): Users asking for straightforward code help -> High Success.
Cluster B (20%): Users trying to jailbreak the model -> High Failure.
Cluster C (30%): Users asking vague questions -> High Tool Loops.

This replaces our slicing and dicing work. It tells you where your volume is coming from, which “types” of requests are failing, and for who.

3. The Business View: The Pareto Front

Replaces: Conversion Rate vs. CAC

Funnels were often used to check conversion, and that in turn would be used figure out CAC:LTV ratio. In the agent world, the key trade-offs (Quality vs. Cost. vs. Latency) happen inside every trace.

We are constantly making choices: Should we use GPT-4o (Smart but expensive) or Llama-3-70b (Fast and cheap)? Should we do 3 retries or 0?

The Pareto Front visualizes this trade-off. By plotting your runs on a graph where X is “Cost per Run” and Y is “Eval Score,” you can find the efficient frontier. You might discover that switching to a cheaper model drops your Eval score by only 1%, but cuts your cost by 50%. A funnel obfuscates that efficiency gain; a Pareto chart makes it obvious.

4. The Causal View: Eval Heatmaps

Replaces: Drop-off Analysis

This is the holy grail. The single most valuable thing a funnel did was imply causality. “They dropped off at Step 2, therefore Step 2 is broken.”

In an agent, sequence ≠ causality. An agent might fail at the final step (generating an answer) not because the generation model is bad, but because the retrieval step (Step 1) fetched irrelevant documents. The failure was “poisoned” upstream.

To solve this, we need Correlation Heatmaps. We run “Intermediate Evals” on every step (e.g., Retrieval Precision, Plan Quality) and correlate them with the “Final Eval” (e.g., User Satisfaction).

A heatmap might reveal a bright red correlation between “Poor Retrieval” and “Poor Final Answer,” even if the retrieval step itself didn’t throw an error. This is your Root Cause Detector. It tells you which lever to pull to actually fix the outcome.

PS. we probably should have done this more for product analytics too - churn at a later step was probably often due to flaws earlier in the journey!

The Toolkit Has Changed

The “Divide and Conquer” philosophy of the funnel is still valid. We still need to break big problems into small ones.

But we can no longer rely on the comforting illusion of a linear path. We have to get comfortable with the messiness of the loop.

I write this blog because I’m interested in agent analytics, and also because I want you try my product TwoTail!

Summary

Funnels were the ultimate 'divide and conquer' tool — but they assume linearity agents don't have.
Replace session replay with trace waterfalls (flame graphs for reasoning).
Replace funnel slicing with semantic clustering on inputs and outputs.
Replace CAC:LTV framing with Pareto fronts over cost vs quality.
Replace drop-off analysis with eval correlation heatmaps — sequence isn't causality.

Frequently asked questions

Why don't funnel charts work for agents?

Funnels require a linear path — every successful user walks the same sequence of steps. Agents loop, retry, and take variable-length paths. Forcing them into a funnel produces a mess of 'Other' buckets that hides the actual behaviour.

What's a trace waterfall?

A flame graph of an agent's reasoning hierarchy — spans inside spans, tool calls inside thoughts. It shows where latency comes from and where the logic branched. It replaces session replay as the way you inspect a single run in detail.

What does semantic clustering actually cluster on?

The inputs and outputs of each run. Runs get grouped into 'islands of behaviour' — e.g. users asking straightforward coding questions (high success) vs users trying to jailbreak the model (high failure). This is how you aggregate runs when step-based aggregation doesn't work.

What is a Pareto front in this context?

A chart plotting each run's cost-per-run against its eval score. The efficient frontier shows the best trade-off points — e.g. a cheaper model that scores 1% worse but costs 50% less. A funnel hides that kind of efficiency gain; a Pareto chart makes it obvious.

Why do eval heatmaps reveal causality better than funnels?

An agent can fail at the final step because the retrieval step upstream fetched irrelevant documents. Sequence isn't causality. A correlation heatmap between intermediate-step evals (e.g. retrieval precision) and the final eval surfaces the real root cause — the lever worth pulling.

Ship agents you actually understand.

TwoTail turns your OpenTelemetry traces into plain-English analysis, failure clusters, and eval patterns.

Book a demo

Timothy Daniell

Founder of TwoTail. Building agent analytics for teams shipping AI agents to production.