Our last round of optimizing the wiki racer agent eliminated most of the bugs, and left the majority of races failing simply by losing the game (reaching 10 steps).
This meant it was time to dig into these failures, and try to understand what separates a losing race from a winning one. Is it related to the start and target page? Are races lost because of decisions at the start or the end? And what can we do to change failed races to successes, without changing successes into failures?
Evals
Because wikipedia is so broad, races and their paths can take many forms. So I decided to implement 3 evals for the agent to capture the idea of “semantic similarity distance” between the current page and the target:
The category overlap eval does what it says on the tin: measure how much the categories that two pages belong to overlap with each other.
The embedding similarity eval embeds the page titles in a vector space, and calculates closeness based on that.
The LLM proximity eval uses model-as-a-judge to evaluate the similarity of the content of the two pages.
Looking at the chart, where we see each eval measured as the race progresses, it looks like LLM proximity best matches the shape we’d expect - as we click links, we get closer to the target.
Analysis
Let’s then compare LLM proximity for failed vs successful races:
A few interesting patterns:
- step 0: the eval at the start, which is random, clearly affects the likely outcome of the race
- step 1-3: it looks like failed races “escape” the starting point with slightly slower velocity than successes
- step 4+: failed races level off at a lower level. This is almost guaranteed because of survival bias in the success group
This provokes an interesting question: do failed races never get close to the target?
It turns out that’s not true: they get very close. But likely slower, and often they will get close then move away again when they don’t find the target. Both of these bring the average down.
At this point, it makes sense to develop some intuition about “failed paths” i.e. which pages the agent goes to when it fails.
I looked at a sample of 10 traces. With close inspection, I categorised them as follows:
- 1 had a hidden bug: the correct page was found, but because of a special character (an accent) in the page name → I’ll fix this!
- 4 had misunderstood the target page: ambiguity in the naming meant the (strong) description model had mistaken the target for something else. → I’ll measure how often this happens with a new eval
- 5 got close and oscillated around, 3 of which had a “slow start” → Not sure yet how to address this!
Deep Analysis with TwoTail
Before working on the fixes, I wanted to get a second opinion, so I used the “deep analysis” feature of Twotail. Currently this is in beta, so no screenshot here, but the conclusion was similar: failed races oscillate. It also came up with some appealing recommendations:
- feed into the agent decision step how many steps it has left
- feed into the agent decision step the current proximity to the target, and adapt the link picking strategy as you get closer
I implemented all of this. Let’s see what happens!
The Results
The Efficacy of the Proximity Prompting Strategy
The early results for the new strategy show a modest uplift:
Around 5 percentage points improvement on the success rate for the previous strategy (so 65% of races win instead of 60%), but that’s not significant at this volume. Since it’s winning I’ll leave it running as the new baseline, but it doesn’t look like a game-changer.
Measuring Target Knowledge
After a few days of data with our new eval, a clear pattern is emerging:
The agent only correctly identifies the target page around 80% of the time. And predictably, the race success rate is much higher when the target is correctly identified. 75% success rate looks achievable if we nail target analysis - maybe a little lower if you consider the likely correlation between targets that are hard to analyse and also hard to navigate to.
Next Steps
I’m encouraged to see if there’s a way to improve target accuracy. Alongside that, I’m going to dig into the failed races again, and possibly try some more radical prompting strategies.