/features

From Raw Traces to Shipped Improvements

TwoTail sits on top of your agent observability. It reads your traces, finds what's wrong, and proves the change before you ship it across three stages.

01 / Observability

Every run, fully reconstructed.

Point your OpenTelemetry exporter at TwoTail and every agent run becomes a structured, queryable trace. No SDK, no code changes.

Traces

See exactly what your agent did.

Every span, in order, with timing, tokens, cost, inputs, and outputs. Expand any step to see the prompt that ran and the response that came back, all the way down to the leaf LLM call.

  • Waterfall view reconstructs the full call tree: agent runs, tool calls, LLM calls, and evals.
  • Per-span cost & tokens attributed to the exact leaf call that spent them.
  • Inputs & outputs on every span: the exact prompt that ran and the response that came back.
trace · 0e1f…a42
spans 7 duration 2.34s tokens 1,284 cost $0.0042
runagent.run2.34s
tooltool.search340ms
llmllm.gpt-4o2.10s
tooltool.fetch_doc120ms
llmeval.judge680ms
spanllm.gpt-4o statusok tokens1,180 cost$0.0039
Connect

Connect once. No SDK.

Point your existing OpenTelemetry exporter at TwoTail and you're done. If your framework already emits OTel spans, there's nothing to install and no agent code to change.

  • OTel-native (OTLP). Send the spans you already produce, two environment variables.
  • Works with LangChain, LlamaIndex, CrewAI, the Vercel AI SDK, or custom spans.
  • Live the moment you point the exporter, no waiting on a backfill.
.env
OTEL_EXPORTER_OTLP_ENDPOINT=https://in.twotail.ai OTEL_EXPORTER_OTLP_HEADERS=x-twotail-key=sk_live_…
receiving spans · 14.2k today
LangChainLlamaIndexCrewAIVercel AI SDKOpenAI Agentscustom

02 / Analytics

Ask anything. It investigates the rest.

Ask questions in plain English, teach it your product, and let it investigate on its own. The analysis comes back as a diagnosis, not another dashboard.

Autonomy

An analyst that never clocks out.

TwoTail investigates on its own, working through research-backed, battle-tested playbooks the best data teams use: failure clustering, latency decomposition, cost attribution. It queues the work, runs it, and sends you a diagnosis when something's worth your attention.

  • Battle-tested playbooks grounded in analytics research, run automatically with no query to write.
  • Always-on across thousands of runs, catching regressions, drift, and weak segments before you do.
  • Hands you a diagnosis with evidence, not another wall of charts to read.
/autonomylive
done2 min ago
Analyze eval trends across releases
runningembedding 14k traces
Semantic clustering of low-performing traces
queuedin 30s
Test prompt improvement in the sandbox
playbooks failure clustering latency decomposition cost attribution
Vocabulary

Teach it your language.

Tell TwoTail your strategy, your terminology, and what success means. Every answer comes back in your terms, not generic LLM-speak, and the data structure is auto-generated from your own traces.

  • Strategy & terminology define your goals, KPIs, and domain concepts once.
  • User segments like power_user become reusable across every analysis.
  • Data structure refreshes itself from your live span hierarchy.
vocabulary
Strategy Terminology Data Structure
power_user≥ 3 sessions / week, > 5 messages each
resolvedticket closed without human escalation
successeval.resolution_judge ≥ 0.7 on the final turn
Quick Question

Ask in plain English.

Type a question. TwoTail writes the SQL, runs it against your traces, picks the right chart, and hands back a headline you can act on. No query language, no dashboard building.

  • Question to insight in one step: SQL, data, visualization, and a written takeaway.
  • Every answer is backed by the rows it ran on, so you can verify it.
analyze
Which intents have the lowest resolution rate?Send
SELECT intent, AVG(resolved) FROM runs GROUP BY intent · 1,240 rows

Answer

Refund requests resolve 38% of the time, far below the 82% average.

faq88%
how-to81%
billing76%
refund38%
Charts

The right chart, automatically.

Bar, line, box plot, radar, clustering, single-metric, A/B. TwoTail auto-detects the right visualization for the question, and you can pin any chart to a dashboard to watch it over time.

  • Auto-detected chart types, or override the choice yourself.
  • Dashboards collect the charts that matter and share with your team.
dashboard
success rate / week
runs by intent
cost / model
latency p50 / p95

03 / Optimization

Test it offline, then prove it in production.

A finding is only half the job. TwoTail curates the test set, replays your real traces against variants offline, then measures the winner on live runs, so every change you ship is grounded in evidence.

Datasets

The ground truth behind every test.

Group traces into datasets you can label by hand, use to calibrate your LLM judges against real human verdicts, and pin as the test set for sandbox experiments. One curated set, trusted everywhere.

  • Human annotation to build ground truth, one labeled trace at a time.
  • Judge calibration: align your LLM-as-judge evals to the human labels so the scores hold up.
  • Curated test sets the sandbox replays, so every experiment runs on the same trusted inputs.
dataset · refund-intent
refund-intent · golden set120 traces · human-labeled
trace 0e1f…a42resolved
trace 7a2c…9d1escalated
trace 9b41…c08resolved
judge ↔ human agreement0.91
Sandbox

Experiment against real traces, offline.

Pick a task from production, define your variants, choose which evals to score, and run it against a sample of real inputs. Nothing touches production, no risk, no waiting for new data.

  • Replays real inputs sampled randomly, by eval score, or by specific span.
  • Test a prompt change or a model swap on the same axis.
  • Returns a written recommendation you can paste straight into a PR.
new sandbox run
1Type 2Task 3Variants 4Evals 5Inputs 6Review

Define your variants

basebilling-support prompt
variant A+ 3 few-shot examples
Experiments

Measure your changes in production.

Roll a change out to a fraction of your live runs and let TwoTail measure it against the control on your real metrics, with statistical significance, so you keep what genuinely moves the needle and roll back what doesn't.

  • Live rollout to a fraction of runs, control vs variant, on the success metric you care about.
  • Significance-tested with 95% confidence intervals, so you ship on signal, not noise.
production A/Blive
P-Value0.003
Result (α=0.05)Significant
100 50 0 71% baseline 84% + few-shot
ship+ few-shot won in production: +13pp resolution (71% to 84%) at p=0.003. Roll out.

See it on your own traces.

Setup in 10 minutes. First insights within a week.