Case Studies · TwoTail

Case 01 · Marketing platform · ad-copy generation agent

Figuring out which evals actually matter.

A marketing platform ran a dozen evals against its ad-copy agent but had no idea which ones tracked real quality. TwoTail found the two that predicted whether a human editor approved the copy, and showed the rest were noise.

12 → 2evals that actually predict editor approval

Vocabulary Quick Question Autonomy Charts

Define what "good" actually means

The team captured their terminology once: the quality dimensions they cared about (brand voice, factual accuracy, length) and the real business outcome, editor-approved, so every analysis spoke their language.

featureVocabulary

vocabulary · captured once

brand_voiceon-brand tone & register

factual_accuracyclaims supported by the brief

lengthwithin the channel's limit

editor_approvedoutcome · shipped by a human editor

Cluster the eval rationales

TwoTail grouped thousands of free-text eval reasons into themes on its own, surfacing the recurring failure modes: off-brand tone, unsupported claims, and bloated copy.

featureAutonomy · clustering

failure-mode clusters · 4,182 eval reasons

Off-brand tone

Unsupported claims

Bloated copy

Correlate every eval against the outcome

One question built the correlation matrix: which evals actually move with editor approval, and which just add noise to the scorecard.

featureQuick Question

eval correlation vs. editor-approved

brand_voice0.71

factual_accuracy0.63

keyword_coverage0.24

output_length0.04

Keep the signal, drop the noise

Two evals predicted the outcome; the rest were dropped from the scorecard. The two survivors became the scoring set every future sandbox experiment runs against.

featureSandbox scoring

eval scorecard

✓brand_voice

✓factual_accuracy

✕keyword_coverage

✕output_length

+ 8 more dropped

outcome

From 12 evals down to the 2 that predict editor approval. Every experiment is now scored on what actually correlates with quality, not on metrics that were quietly measuring nothing.

Case 02 · Customer-support chat agent

Segment intent, measure success, find the root cause.

A support agent's overall success rate looked healthy. Segmenting by what users actually asked for revealed one intent dragging the average down, and a retrieval gap underneath it.

38% vs 82%resolution gap on one intent, found in an afternoon

Quick Question Charts Autonomy

Segment conversations by intent

Semantic clustering grouped raw conversations into the intents users actually arrived with: FAQ, how-to, billing, refund, and ambiguous, no manual tagging required.

featureAutonomy · clustering

conversation share by intent · 9,400 chats

faq31%

how-to24%

billing18%

refund14%

ambiguous13%

Measure success rate per intent

The aggregate number hid the problem. Broken out by intent, one segment stood far below the rest.

featureQuick Question · Charts

resolution rate by intent

88%

faq

84%

how-to

79%

billing

38%

refund

72%

other

Cluster the failures, compare to the wins

Pulling the refund-intent failures and clustering them against the successes exposed the structural difference: failing runs retrieved irrelevant documents, the winners didn't.

featureQuick Question

low retrieval precision (< 0.4)

68%

of refund failures

12%

of refund successes

Hand back a change worth testing

The diagnosis came with a direction: add query rewriting for refund-shaped questions and a dedicated refund-lookup skill, then prove it in the sandbox before shipping.

featureSandbox

changeAdd query rewriting for refund-shaped questions and a dedicated refund-lookup skill. Validate in the sandbox before it touches prod.

outcome

The one intent dragging the average down, and the retrieval gap causing it, found in an afternoon, not a quarter of dashboard-staring.

Case 03 · Research-assistant startup · multi-step web-research agent

Experimenting to balance cost and quality.

A research-assistant startup's agent calls an LLM at every step to choose its next source. They wanted to cut spend on that per-step call without dropping task completion, so they tested it offline before touching production.

−42%cost on the per-step call, completion within 3 points

Quick Question Sandbox Experiments

Find the cost driver

Grouping spend by span and model pointed straight at the culprit: the per-step source-selection LLM call, run once for every hop, dominated the bill.

featureQuick Question

spend by span · last 7 days

source-selection71%

synthesis19%

other10%

Test a cheaper model offline

The sandbox replayed the real source-selection inputs against a cheaper model on the same axis, scored by the relevance eval already attached to every step. Nothing touched production.

featureSandbox

sandbox run · axis: model

baseflagship$0.024 / run

variantlite model$0.014 / run

replaying 200 real source-selection inputs

Compare the trade-off

The results put cost against win rate side by side. The cheaper model held quality within a few points at a fraction of the spend.

featureExperiments

source-selection variants · relevance-scored

variant	completion	cost / run
baseline · flagship	62%	$0.024
lite model	59%	$0.014

Validate on a random sample

A follow-up run confirmed the lever held on routine queries, not just the expensive outliers, so the saving was real rather than a sampling artifact.

featureExperiments

random sample · 200 runs

61%

lite model completion

63%

baseline completion

outcome

−42% cost on the source-selection step, completion within 3 points of baseline, shipped with the receipts to back the call.

Real Problems, Solved End to End

Figuring out which evals actually matter.

Define what "good" actually means

Cluster the eval rationales

Correlate every eval against the outcome

Keep the signal, drop the noise

Segment intent, measure success, find the root cause.

Segment conversations by intent

Measure success rate per intent

Cluster the failures, compare to the wins

Hand back a change worth testing

Experimenting to balance cost and quality.

Find the cost driver

Test a cheaper model offline

Compare the trade-off

Validate on a random sample

What would TwoTail find in your traces?