Home/ Articles/ Agent optimization experiment #3: evals and proximity

Experiments

Agent optimization experiment #3: evals and proximity

Three proximity evals — category overlap, embedding similarity, LLM-as-judge — to diagnose Wiki Racer failures. Found the agent oscillates near targets.

Timothy Daniell · Published January 27, 2026 · 7 min read

Key takeaways

Three proximity evals tried: category overlap, embedding similarity, and LLM-as-judge proximity. LLM-as-judge best matched the expected 'getting closer' trajectory.
Failed races don't fail to get close to the target — they oscillate around it, then drift away.
Qualitative review of 10 failed traces surfaced three categories: 1 bug (accented characters), 4 target-description errors, 5 oscillation failures.
TwoTail's 'deep analysis' agreed with the oscillation diagnosis and suggested two prompting changes: feed in the steps-remaining count, and feed in current proximity so the agent can adapt near the target.
After implementing both, success rate moved ~5pp (60% → 65%) — positive but not statistically significant at that volume.
The more promising lever is target-identification accuracy: the agent correctly identifies the target page ~80% of the time, and success rate is much higher when the target is right.

Our last round of optimizing the wiki racer agent eliminated most of the bugs, and left the majority of races failing simply by losing the game (reaching 10 steps).

This meant it was time to dig into these failures, and try to understand what separates a losing race from a winning one. Is it related to the start and target page? Are races lost because of decisions at the start or the end? And what can we do to change failed races to successes, without changing successes into failures?

Evals

Because wikipedia is so broad, races and their paths can take many forms. So I decided to implement 3 evals for the agent to capture the idea of “semantic similarity distance” between the current page and the target:

The category overlap eval does what it says on the tin: measure how much the categories that two pages belong to overlap with each other.

The embedding similarity eval embeds the page titles in a vector space, and calculates closeness based on that.

The LLM proximity eval uses model-as-a-judge to evaluate the similarity of the content of the two pages.

Looking at the chart, where we see each eval measured as the race progresses, it looks like LLM proximity best matches the shape we’d expect - as we click links, we get closer to the target.

Analysis

Let’s then compare LLM proximity for failed vs successful races:

A few interesting patterns:

step 0: the eval at the start, which is random, clearly affects the likely outcome of the race
step 1-3: it looks like failed races “escape” the starting point with slightly slower velocity than successes
step 4+: failed races level off at a lower level. This is almost guaranteed because of survival bias in the success group

This provokes an interesting question: do failed races never get close to the target?

It turns out that’s not true: they get very close. But likely slower, and often they will get close then move away again when they don’t find the target. Both of these bring the average down.

At this point, it makes sense to develop some intuition about “failed paths” i.e. which pages the agent goes to when it fails.

I looked at a sample of 10 traces. With close inspection, I categorised them as follows:

1 had a hidden bug: the correct page was found, but because of a special character (an accent) in the page name → I’ll fix this!
4 had misunderstood the target page: ambiguity in the naming meant the (strong) description model had mistaken the target for something else. → I’ll measure how often this happens with a new eval
5 got close and oscillated around, 3 of which had a “slow start” → Not sure yet how to address this!

Deep Analysis with TwoTail

Before working on the fixes, I wanted to get a second opinion, so I used the “deep analysis” feature of Twotail. Currently this is in beta, so no screenshot here, but the conclusion was similar: failed races oscillate. It also came up with some appealing recommendations:

feed into the agent decision step how many steps it has left
feed into the agent decision step the current proximity to the target, and adapt the link picking strategy as you get closer

I implemented all of this. Let’s see what happens!

The Results

The Efficacy of the Proximity Prompting Strategy

The early results for the new strategy show a modest uplift:

Around 5 percentage points improvement on the success rate for the previous strategy (so 65% of races win instead of 60%), but that’s not significant at this volume. Since it’s winning I’ll leave it running as the new baseline, but it doesn’t look like a game-changer.

Measuring Target Knowledge

After a few days of data with our new eval, a clear pattern is emerging:

The agent only correctly identifies the target page around 80% of the time. And predictably, the race success rate is much higher when the target is correctly identified. 75% success rate looks achievable if we nail target analysis - maybe a little lower if you consider the likely correlation between targets that are hard to analyse and also hard to navigate to.

Next Steps

I’m encouraged to see if there’s a way to improve target accuracy. Alongside that, I’m going to dig into the failed races again, and possibly try some more radical prompting strategies.

Summary

If you don't know how to measure 'closer to the answer', build a few candidate evals and see which best matches the expected trajectory.
LLM-as-judge proximity is the most interpretable of the three for open-domain content.
Failure modes aren't always 'never got close' — sometimes they're 'got close and oscillated'.
A 5pp success-rate uplift isn't always statistically significant, but it's worth leaving running as the new baseline.
The highest-leverage next fix is likely upstream: nail target identification first.

Frequently asked questions

What is a proximity eval?

A metric that scores how close the agent's current state is to the goal state. For Wiki Racer, it's the semantic similarity between the current Wikipedia page and the target page — measured three ways: category overlap, embedding distance, and LLM-as-judge.

Why did LLM-as-judge work best?

Wikipedia's breadth means category overlap and title-embedding similarity miss a lot of semantic closeness. LLM proximity reads the actual content of both pages and judges similarity — it's more expensive per call but matches the expected trajectory (getting closer as you click) much more faithfully.

What's 'survival bias' in this context?

Successful races terminate when they hit the target, so their later-step proximity samples are artificially inflated (only the winners are present). Failed races have no such selection, which pulls their average proximity down — some of that gap is real, some is the selection effect.

Why wasn't the proximity-prompting improvement significant?

5pp on moderate sample size isn't enough to reject the null. It's plausible noise. The decision to keep it as the baseline is pragmatic: it's not worse, and leaving it running accumulates data toward significance.

How do you measure whether the agent correctly identifies the target?

A dedicated eval that compares the agent's target description at the start of the race against the ground-truth target page. Over multiple days of data, this eval has stabilised at ~80% accuracy — and success rate is markedly higher in the 'correct target' subset.

Ship agents you actually understand.

TwoTail turns your OpenTelemetry traces into plain-English analysis, failure clusters, and eval patterns.

Book a demo

Timothy Daniell

Founder of TwoTail. Building agent analytics for teams shipping AI agents to production.