/case-studies

Real Problems, Solved End to End

Each of these starts with raw traces and ends with a shipped change. Here's how the features compose into a workflow.

Case 01 · Marketing platform · ad-copy generation agent

Figuring out which evals actually matter.

A marketing platform ran a dozen evals against its ad-copy agent but had no idea which ones tracked real quality. TwoTail found the two that predicted whether a human editor approved the copy, and showed the rest were noise.

12 → 2evals that actually predict editor approval
1

Define what "good" actually means

The team captured their terminology once: the quality dimensions they cared about (brand voice, factual accuracy, length) and the real business outcome, editor-approved, so every analysis spoke their language.

featureVocabulary
vocabulary · captured once
brand_voiceon-brand tone & register
factual_accuracyclaims supported by the brief
lengthwithin the channel's limit
editor_approvedoutcome · shipped by a human editor
2

Cluster the eval rationales

TwoTail grouped thousands of free-text eval reasons into themes on its own, surfacing the recurring failure modes: off-brand tone, unsupported claims, and bloated copy.

failure-mode clusters · 4,182 eval reasons
Off-brand tone
Unsupported claims
Bloated copy
3

Correlate every eval against the outcome

One question built the correlation matrix: which evals actually move with editor approval, and which just add noise to the scorecard.

eval correlation vs. editor-approved
brand_voice0.71
factual_accuracy0.63
keyword_coverage0.24
output_length0.04
4

Keep the signal, drop the noise

Two evals predicted the outcome; the rest were dropped from the scorecard. The two survivors became the scoring set every future sandbox experiment runs against.

eval scorecard
brand_voice
factual_accuracy
keyword_coverage
output_length
+ 8 more dropped
outcome

From 12 evals down to the 2 that predict editor approval. Every experiment is now scored on what actually correlates with quality, not on metrics that were quietly measuring nothing.

Case 02 · Customer-support chat agent

Segment intent, measure success, find the root cause.

A support agent's overall success rate looked healthy. Segmenting by what users actually asked for revealed one intent dragging the average down, and a retrieval gap underneath it.

38% vs 82%resolution gap on one intent, found in an afternoon
1

Segment conversations by intent

Semantic clustering grouped raw conversations into the intents users actually arrived with: FAQ, how-to, billing, refund, and ambiguous, no manual tagging required.

conversation share by intent · 9,400 chats
faq31%
how-to24%
billing18%
refund14%
ambiguous13%
2

Measure success rate per intent

The aggregate number hid the problem. Broken out by intent, one segment stood far below the rest.

resolution rate by intent
88%
faq
84%
how-to
79%
billing
38%
refund
72%
other
3

Cluster the failures, compare to the wins

Pulling the refund-intent failures and clustering them against the successes exposed the structural difference: failing runs retrieved irrelevant documents, the winners didn't.

low retrieval precision (< 0.4)
68%
of refund failures
12%
of refund successes
4

Hand back a change worth testing

The diagnosis came with a direction: add query rewriting for refund-shaped questions and a dedicated refund-lookup skill, then prove it in the sandbox before shipping.

featureSandbox
changeAdd query rewriting for refund-shaped questions and a dedicated refund-lookup skill. Validate in the sandbox before it touches prod.
outcome

The one intent dragging the average down, and the retrieval gap causing it, found in an afternoon, not a quarter of dashboard-staring.

Case 03 · Research-assistant startup · multi-step web-research agent

Experimenting to balance cost and quality.

A research-assistant startup's agent calls an LLM at every step to choose its next source. They wanted to cut spend on that per-step call without dropping task completion, so they tested it offline before touching production.

−42%cost on the per-step call, completion within 3 points
1

Find the cost driver

Grouping spend by span and model pointed straight at the culprit: the per-step source-selection LLM call, run once for every hop, dominated the bill.

spend by span · last 7 days
source-selection71%
synthesis19%
other10%
2

Test a cheaper model offline

The sandbox replayed the real source-selection inputs against a cheaper model on the same axis, scored by the relevance eval already attached to every step. Nothing touched production.

featureSandbox
sandbox run · axis: model
baseflagship$0.024 / run
variantlite model$0.014 / run
replaying 200 real source-selection inputs
3

Compare the trade-off

The results put cost against win rate side by side. The cheaper model held quality within a few points at a fraction of the spend.

source-selection variants · relevance-scored
variantcompletioncost / run
baseline · flagship62%$0.024
lite model59%$0.014
4

Validate on a random sample

A follow-up run confirmed the lever held on routine queries, not just the expensive outliers, so the saving was real rather than a sampling artifact.

random sample · 200 runs
61%
lite model completion
63%
baseline completion
outcome

−42% cost on the source-selection step, completion within 3 points of baseline, shipped with the receipts to back the call.

What would TwoTail find in your traces?

Setup in 10 minutes. First insights within a week.