I love(d) Funnel Charts.

They were the ultimate “Divide and Conquer” tool. They allowed us to take a messy, complex user journey and slice it into neat, manageable steps. Step 1: Sign Up. Step 2: Onboarding. Step 3: Purchase.

If the graph dipped at Step 2, you knew exactly where to focus your engineering time. You didn’t need to fix the whole product; you just needed to fix the onboarding. You could slice by segments to find the problematic one, and even dive into individual user journeys.

But the sad thing is this: funnels don’t really fit most agents.

The Problem: Agents Don’t Walk in Straight Lines

The premise of a funnel is linearity. Every successful user walks the same path, and anyone who deviates is a “drop-off.”

Agents are fundamentally non-linear. They loop. They retry. They take shortcuts. One run might take 3 steps; the next might take 30. If you try to force an Agent into a linear funnel, you end up with a mess of “Other” paths and confusing data that hides the reality of the behaviour.

We can’t just “fix” the funnel. We need to replace the jobs it used to do with visualisations native to this new, messier reality.

Here are the four charts I like best for each job.


1. The One-by-One View: Trace Waterfalls

Replaces: Session Replay / Step Drill-down

A Trace Waterfall in twotail.ai

When a funnel showed a drop-off, you’d usually click in to see why. In the old world, you might watch a session recording. In the agent world, we use the Trace Waterfall.

This is a flame graph for reasoning. It visualizes the hierarchy of the agent’s thought process: spans inside of spans, tool calls inside of thoughts. It’s incredibly dense and detailed. It shows you exactly where the latency is coming from (is it the LLM generation? Or the vector search?) and where the logic branched.

The Critique: “But a Waterfall is N=1. I can’t look at 10,000 waterfalls.” The Counter: You don’t have to. The density of text data in a waterfall is perfect for an AI Analyst. We are approaching a point where we can have an agent read 10,000 waterfalls overnight and summarize the failure patterns for you. Which brings us to the next chart.

2. The Agggregate View: Semantic Clustering

Replaces: Funnel Slicing and Dicing

A trace clustering table

If we can’t organize runs by “Step 1 vs Step 2,” how do we aggregate them? We organize them by Intent and Outcome.

Instead of a funnel, we need Semantic Clustering tables. By clustering the inputs and outputs of every run, we can visualize distinct islands of user behavior.

This replaces our slicing and dicing work. It tells you where your volume is coming from, which “types” of requests are failing, and for who.

3. The Business View: The Pareto Front

Replaces: Conversion Rate vs. CAC

Funnels were often used to check conversion, and that in turn would be used figure out CAC:LTV ratio. In the agent world, the key trade-offs (Quality vs. Cost. vs. Latency) happen inside every trace.

We are constantly making choices: Should we use GPT-4o (Smart but expensive) or Llama-3-70b (Fast and cheap)? Should we do 3 retries or 0?

The Pareto Front visualizes this trade-off. By plotting your runs on a graph where X is “Cost per Run” and Y is “Eval Score,” you can find the efficient frontier. You might discover that switching to a cheaper model drops your Eval score by only 1%, but cuts your cost by 50%. A funnel obfuscates that efficiency gain; a Pareto chart makes it obvious.

4. The Causal View: Eval Heatmaps

Replaces: Drop-off Analysis

This is the holy grail. The single most valuable thing a funnel did was imply causality. “They dropped off at Step 2, therefore Step 2 is broken.”

In an agent, sequence ≠ causality. An agent might fail at the final step (generating an answer) not because the generation model is bad, but because the retrieval step (Step 1) fetched irrelevant documents. The failure was “poisoned” upstream.

To solve this, we need Correlation Heatmaps. We run “Intermediate Evals” on every step (e.g., Retrieval Precision, Plan Quality) and correlate them with the “Final Eval” (e.g., User Satisfaction).

A heatmap might reveal a bright red correlation between “Poor Retrieval” and “Poor Final Answer,” even if the retrieval step itself didn’t throw an error. This is your Root Cause Detector. It tells you which lever to pull to actually fix the outcome.

PS. we probably should have done this more for product analytics too - churn at a later step was probably often due to flaws earlier in the journey!


The Toolkit Has Changed

The “Divide and Conquer” philosophy of the funnel is still valid. We still need to break big problems into small ones.

But we can no longer rely on the comforting illusion of a linear path. We have to get comfortable with the messiness of the loop.


I write this blog because I’m interested in agent analytics, and also because I want you try my product TwoTail!