Our last round of optimizing the wiki racer agent eliminated most of the bugs, and left the majority of races failing simply by losing the game (reaching 10 steps).

This meant it was time to dig into these failures, and try to understand what separates a losing race from a winning one. Is it related to the start and target page? Are races lost because of decisions at the start or the end? And what can we do to change failed races to successes, without changing successes into failures?

Evals

Because wikipedia is so broad, races and their paths can take many forms. So I decided to implement 3 evals for the agent to capture the idea of “semantic similarity distance” between the current page and the target:

The category overlap eval does what it says on the tin: measure how much the categories that two pages belong to overlap with each other.

The embedding similarity eval embeds the page titles in a vector space, and calculates closeness based on that.

The LLM proximity eval uses model-as-a-judge to evaluate the similarity of the content of the two pages.

Looking at the chart, where we see each eval measured as the race progresses, it looks like LLM proximity best matches the shape we’d expect - as we click links, we get closer to the target.

Analysis

Let’s then compare LLM proximity for failed vs successful races:

A few interesting patterns:

This provokes an interesting question: do failed races never get close to the target?

It turns out that’s not true: they get very close. But likely slower, and often they will get close then move away again when they don’t find the target. Both of these bring the average down.

At this point, it makes sense to develop some intuition about “failed paths” i.e. which pages the agent goes to when it fails.

I looked at a sample of 10 traces. With close inspection, I categorised them as follows:

Deep Analysis with TwoTail

Before working on the fixes, I wanted to get a second opinion, so I used the “deep analysis” feature of Twotail. Currently this is in beta, so no screenshot here, but the conclusion was similar: failed races oscillate. It also came up with some appealing recommendations:

I implemented all of this. Let’s see what happens!

The Results

The Efficacy of the Proximity Prompting Strategy

The early results for the new strategy show a modest uplift:

Around 5 percentage points improvement on the success rate for the previous strategy (so 65% of races win instead of 60%), but that’s not significant at this volume. Since it’s winning I’ll leave it running as the new baseline, but it doesn’t look like a game-changer.

Measuring Target Knowledge

After a few days of data with our new eval, a clear pattern is emerging:

The agent only correctly identifies the target page around 80% of the time. And predictably, the race success rate is much higher when the target is correctly identified. 75% success rate looks achievable if we nail target analysis - maybe a little lower if you consider the likely correlation between targets that are hard to analyse and also hard to navigate to.

Next Steps

I’m encouraged to see if there’s a way to improve target accuracy. Alongside that, I’m going to dig into the failed races again, and possibly try some more radical prompting strategies.