Each of these starts with raw traces and ends with a shipped change. Here's how the features compose into a workflow.
Case 01 · Marketing platform · ad-copy generation agent
A marketing platform ran a dozen evals against its ad-copy agent but had no idea which ones tracked real quality. TwoTail found the two that predicted whether a human editor approved the copy, and showed the rest were noise.
The team captured their terminology once: the quality dimensions they cared about (brand voice, factual accuracy, length) and the real business outcome, editor-approved, so every analysis spoke their language.
TwoTail grouped thousands of free-text eval reasons into themes on its own, surfacing the recurring failure modes: off-brand tone, unsupported claims, and bloated copy.
One question built the correlation matrix: which evals actually move with editor approval, and which just add noise to the scorecard.
Two evals predicted the outcome; the rest were dropped from the scorecard. The two survivors became the scoring set every future sandbox experiment runs against.
From 12 evals down to the 2 that predict editor approval. Every experiment is now scored on what actually correlates with quality, not on metrics that were quietly measuring nothing.
Case 02 · Customer-support chat agent
A support agent's overall success rate looked healthy. Segmenting by what users actually asked for revealed one intent dragging the average down, and a retrieval gap underneath it.
Semantic clustering grouped raw conversations into the intents users actually arrived with: FAQ, how-to, billing, refund, and ambiguous, no manual tagging required.
The aggregate number hid the problem. Broken out by intent, one segment stood far below the rest.
Pulling the refund-intent failures and clustering them against the successes exposed the structural difference: failing runs retrieved irrelevant documents, the winners didn't.
The diagnosis came with a direction: add query rewriting for refund-shaped questions and a dedicated refund-lookup skill, then prove it in the sandbox before shipping.
The one intent dragging the average down, and the retrieval gap causing it, found in an afternoon, not a quarter of dashboard-staring.
Case 03 · Research-assistant startup · multi-step web-research agent
A research-assistant startup's agent calls an LLM at every step to choose its next source. They wanted to cut spend on that per-step call without dropping task completion, so they tested it offline before touching production.
Grouping spend by span and model pointed straight at the culprit: the per-step source-selection LLM call, run once for every hop, dominated the bill.
The sandbox replayed the real source-selection inputs against a cheaper model on the same axis, scored by the relevance eval already attached to every step. Nothing touched production.
The results put cost against win rate side by side. The cheaper model held quality within a few points at a fraction of the spend.
| variant | completion | cost / run |
|---|---|---|
| baseline · flagship | 62% | $0.024 |
| lite model | 59% | $0.014 |
A follow-up run confirmed the lever held on routine queries, not just the expensive outliers, so the saving was real rather than a sampling artifact.
−42% cost on the source-selection step, completion within 3 points of baseline, shipped with the receipts to back the call.
Setup in 10 minutes. First insights within a week.