Features · TwoTail

01 / Observability

Every run, fully reconstructed.

Point your OpenTelemetry exporter at TwoTail and every agent run becomes a structured, queryable trace. No SDK, no code changes.

Traces

See exactly what your agent did.

Every span, in order, with timing, tokens, cost, inputs, and outputs. Expand any step to see the prompt that ran and the response that came back, all the way down to the leaf LLM call.

Waterfall view reconstructs the full call tree: agent runs, tool calls, LLM calls, and evals.
Per-span cost & tokens attributed to the exact leaf call that spent them.
Inputs & outputs on every span: the exact prompt that ran and the response that came back.

trace · 0e1f…a42

spans 7 duration 2.34s tokens 1,284 cost $0.0042

runagent.run2.34s

tooltool.search340ms

llmllm.gpt-4o2.10s

tooltool.fetch_doc120ms

llmeval.judge680ms

spanllm.gpt-4o statusok tokens1,180 cost$0.0039

Connect

Connect once. No SDK.

Point your existing OpenTelemetry exporter at TwoTail and you're done. If your framework already emits OTel spans, there's nothing to install and no agent code to change.

OTel-native (OTLP). Send the spans you already produce, two environment variables.
Works with LangChain, LlamaIndex, CrewAI, the Vercel AI SDK, or custom spans.
Live the moment you point the exporter, no waiting on a backfill.

.env

OTEL_EXPORTER_OTLP_ENDPOINT=https://in.twotail.ai OTEL_EXPORTER_OTLP_HEADERS=x-twotail-key=sk_live_…

receiving spans · 14.2k today

LangChainLlamaIndexCrewAIVercel AI SDKOpenAI Agentscustom

Evals

Evals that map to outcomes.

Bring your own evals or set them up with TwoTail. It can even discover new ones on its own, then continuously check which evals actually track the business outcomes you care about, so you can focus on optimizing what matters.

Bring your own, or build them here. Send the evals you already run, or configure them with TwoTail.
Auto-discovered. The autonomous analyst proposes new evals from patterns in your traces.
Mapped to outcomes. TwoTail tracks which evals correlate with your business metrics and drops the noise.

evals↔ resolution

resolution_judgeyours✓

tone_matchauto✓

citation_validauto✓

response_lengthyours✕

02 / Analytics

Ask anything. It investigates the rest.

Ask questions in plain English, teach it your product, and let it investigate on its own. The analysis comes back as a diagnosis, not another dashboard.

Autonomy

An analyst that never clocks out.

TwoTail investigates on its own, working through research-backed, battle-tested playbooks the best data teams use: failure clustering, latency decomposition, cost attribution. It queues the work, runs it, and sends you a diagnosis when something's worth your attention.

Battle-tested playbooks grounded in analytics research, run automatically with no query to write.
Always-on across thousands of runs, catching regressions, drift, and weak segments before you do.
Hands you a diagnosis with evidence, not another wall of charts to read.

/autonomylive

done2 min ago

Analyze eval trends across releases

runningembedding 14k traces

Semantic clustering of low-performing traces

queuedin 30s

Test prompt improvement in the sandbox

playbooks failure clustering latency decomposition cost attribution

Vocabulary

Teach it your language.

Tell TwoTail your strategy, your terminology, and what success means. Every answer comes back in your terms, not generic LLM-speak, and the data structure is auto-generated from your own traces.

Strategy & terminology define your goals, KPIs, and domain concepts once.
User segments like power_user become reusable across every analysis.
Data structure refreshes itself from your live span hierarchy.

vocabulary

Strategy Terminology Data Structure

power_user≥ 3 sessions / week, > 5 messages each

resolvedticket closed without human escalation

successeval.resolution_judge ≥ 0.7 on the final turn

Quick Question

Ask in plain English.

Type a question. TwoTail writes the SQL, runs it against your traces, picks the right chart, and hands back a headline you can act on. No query language, no dashboard building.

Question to insight in one step: SQL, data, visualization, and a written takeaway.
Every answer is backed by the rows it ran on, so you can verify it.

analyze

Which intents have the lowest resolution rate?Send

SELECT intent, AVG(resolved) FROM runs GROUP BY intent · 1,240 rows

Answer

Refund requests resolve 38% of the time, far below the 82% average.

faq88%

how-to81%

billing76%

refund38%

Charts

The right chart, automatically.

Bar, line, box plot, radar, clustering, single-metric, A/B. TwoTail auto-detects the right visualization for the question, and you can pin any chart to a dashboard to watch it over time.

Auto-detected chart types, or override the choice yourself.
Dashboards collect the charts that matter and share with your team.

dashboard

success rate / week

runs by intent

cost / model

latency p50 / p95

03 / Optimization

Test it offline, then prove it in production.

A finding is only half the job. TwoTail curates the test set, replays your real traces against variants offline, then measures the winner on live runs, so every change you ship is grounded in evidence.

Datasets

The ground truth behind every test.

Group traces into datasets you can label by hand, use to calibrate your LLM judges against real human verdicts, and pin as the test set for sandbox experiments. One curated set, trusted everywhere.

Human annotation to build ground truth, one labeled trace at a time.
Judge calibration: align your LLM-as-judge evals to the human labels so the scores hold up.
Curated test sets the sandbox replays, so every experiment runs on the same trusted inputs.

dataset · refund-intent

refund-intent · golden set120 traces · human-labeled

trace 0e1f…a42resolved

trace 7a2c…9d1escalated

trace 9b41…c08resolved

judge ↔ human agreement0.91

Sandbox

Experiment against real traces, offline.

Pick a task from production, define your variants, choose which evals to score, and run it against a sample of real inputs. Nothing touches production, no risk, no waiting for new data.

Replays real inputs sampled randomly, by eval score, or by specific span.
Test a prompt change or a model swap on the same axis.
Returns a written recommendation you can paste straight into a PR.

new sandbox run

1Type› 2Task› 3Variants› 4Evals› 5Inputs› 6Review

Define your variants

basebilling-support prompt

variant A+ 3 few-shot examples

Experiments

Measure your changes in production.

Roll a change out to a fraction of your live runs and let TwoTail measure it against the control on your real metrics, with statistical significance, so you keep what genuinely moves the needle and roll back what doesn't.

Live rollout to a fraction of runs, control vs variant, on the success metric you care about.
Significance-tested with 95% confidence intervals, so you ship on signal, not noise.

production A/Blive

P-Value0.003

Result (α=0.05)Significant

ship+ few-shot won in production: +13pp resolution (71% to 84%) at p=0.003. Roll out.

From Raw Traces to Shipped Improvements