Dolphin Metrics: the pirate metrics for the agent era

AARRR doesn't fit AI agents. Dolphin Metrics — the 4 E's, Engagement, Execution, Efficacy, Economics — is the framework that replaces it.

Timothy Daniell · Published December 15, 2025 · 5 min read

Key takeaways

AARRR was built for linear human conversion funnels. Agents loop, reason, and pay per token — none of which AARRR captures.
Dolphin Metrics proposes four pillars: Engagement, Execution, Efficacy, Economics (the 4 E's).
Engagement is the only thing that survives from the old world — DAU/MAU, invocation rate.
Execution opens the reasoning black box: latency per step, tool usage, ReAct loop depth, error rates.
Efficacy measures solution quality probabilistically: eval scores, user acceptance rate, CSAT.
Economics tracks what each thought actually costs: cost per session, tokens per task, model efficiency.

For 15 years, SaaS founders sailed happily on the Pirate metrics ship.

AARRR (Acquisition, Activation, Retention, Referral, Revenue) was the perfect framework for the Web 2.0 era. It gave us a logical way to organize our dashboards because software was linear. A human landed on a page, clicked a button, and either converted or churned.

But as I build TwoTail.AI, I’m realizing that Pirates don’t make sense for Agents.

Agents don’t have linear funnels. They have loops. They don’t just “convert”; they reason. You can have high Retention (the agent keeps running) but zero Efficacy (it’s stuck in a hallucination loop). And you always pay a price (tokens).

If we want to measure the new paradigm, we need a new framework.

I call it the Dolphin Metrics (EEEE!).

Here is the dashboard structure I’m using to analyze Agents.

1. Engagement (Who?)

This is the only metric that survives from the old world. Before we care how smart the agent is, we need to know if it’s being used.

The Metrics: Session counts, Active Users (DAU/MAU), and Invocation Rate.
The Question: Is the agent actually being called, or is it gathering dust?

2. Execution (How?)

This is where traditional analytics breaks. We need to open the “Black Box” of the agent’s logic. We aren’t measuring clicks anymore; we are measuring the Trace.

The Metrics: Latency per step, Tool usage frequency, ReAct loop depth (how many thought steps did it take?), and Error rates.
The Question: Where are the bottlenecks? Did the agent take the scenic route to get to an answer? We are looking for “Reasoning Quality”—analyzing the path, not just the destination.

3. Efficacy (What?)

Old software was binary: it worked or it crashed. Agents are probabilistic: they can work mostly. “Efficacy” is the measurement of quality.

The Metrics: Eval scores (e.g., “Hallucination Rate,” “Answer Relevance”), User Acceptance Rate (Did the user copy the code?), and CSAT.
The Question: Did the agent actually solve the problem? This is arguably the most important metric, and the hardest to automate.

4. Economics (How much?)

In SaaS, scaling a database was cheap. In the Agent world, intelligence costs money. Every “thought” burns tokens. A highly effective agent that costs $5.00 per query to run is a failed product.

The Metrics: Cost per Session, Tokens per Task, Model Efficiency (GPT-4 vs Llama-3 routing).
The Question: Is the value generated worth the compute cost?

The New Dashboard

The next time you sit down to build an analytics view for your AI product, stop trying to force it into a Funnel.

Think like a Dolphin: Engagement shows you the users, Execution optimizes the workflow, Efficacy proves the value, Economics makes the business add up.

I’m Timothy Daniell, founder of TwoTail.AI, an analytics tool built for analyzing AI Agents. If you’re working on an agent, I’d love to talk to you. You can reach me here: https://www.linkedin.com/in/timothydaniell/

Summary

AARRR is an artifact of linear Web 2.0 journeys and doesn't translate to reasoning agents.
The Dolphin Metrics (4 E's) give you four dimensions to organize an agent dashboard around.
Engagement confirms the agent is being used at all.
Execution measures how it gets there; Efficacy measures whether it got there well; Economics measures what it cost.
A 'highly effective' agent that costs $5 per query is still a failed product.

Frequently asked questions

What is AARRR and why doesn't it fit agents?

AARRR — Acquisition, Activation, Retention, Referral, Revenue — is the pirate-metrics framework from 2007 for SaaS funnels. It assumes linear human conversion. Agents loop and reason rather than convert, so the framework doesn't hold.

What are the Dolphin Metrics?

Four pillars for agent-product KPIs: Engagement (who is using it), Execution (how the reasoning unfolds), Efficacy (was the problem solved), and Economics (what did it cost).

Why is Engagement still relevant in the agent era?

Because even the smartest agent is worthless if nobody invokes it. Session counts, DAU/MAU, and invocation rate confirm the agent is actually being called.

How do I measure Execution for an agent?

Look at reasoning-path metrics: latency per step, tool usage frequency, ReAct loop depth (how many thought steps did it take?), and error rates. The goal is to evaluate the path, not just the destination.

How does Economics differ from traditional infrastructure cost tracking?

In SaaS, scaling was cheap. For agents, every 'thought' burns tokens — cost is per-inference rather than per-server. A highly effective agent at $5 per query is a failed product, so cost per session and tokens per task become first-class metrics.

Ship agents you actually understand.

TwoTail turns your OpenTelemetry traces into plain-English analysis, failure clusters, and eval patterns.

Book a demo

Timothy Daniell

Founder of TwoTail. Building agent analytics for teams shipping AI agents to production.