This is the first in a series here sharing my real-life attempts to optimize agents - in order to demonstrate techniques and their efficacy.

The Agent

A successful race!

For many of these, I’m going to use my “toy” agent: Wiki Racer. This is a simple agent I’ve built to play the Wikipedia game. Once an hour, it’s given a random wikipedia page to start from, and another to try and navigate to, only by clicking links in articles. The agent calls gemini-flash to make the decision of which link to pick. The race is a success if it reaches the target page in 10 clicks or fewer.

The Baseline

Chart in my agent analytics tool TwoTail.AI

The first version of my agent used a simple prompt which was generated by Claude Code when I built it. It “wins” the race around 50% of the time.

The Experiment

I decided that the first experiment I would make would be to try and improve the prompt. There are a lot of prompt optimization techniques, but I decided to start with a simple change: include in the prompt some “Expert Strategies” on how to play the game (this itself was generated by asking Gemini Pro - is this some sort of meta-prompting or knowledge distillation?).

Gemini Pro advice on how to play the wiki game

The cool thing here, is these strategies have formed based on many human games! You would learn some of these strategies yourself pretty quickly playing the game, but some are clever tricks that you’d take a while to spot. It’ll be interesting to see how much it helps the agent

For this experiment there is no traffic bias so I can simply change the prompt and observe the results “before and after”, rather than needing an A/B test. The KPI for this experiment will be race success rate, and so we can use a Chi-squared test to calculate statistical significance. I will use a sample size of 100 races per variant, which can detect a 20pp uplift in success with power 80% and significance 5% - that’s a larger minimum effect than I would like, but it’ll do for the first experiment (I can always ramp up the agent frequency later).

The Results

I left the experiment on over Christmas holidays, so the new variant ended up with 500 runs. Here are the results:

A resounding failure!

The new strategy is worse than the old one, that was unexpected.

Using TwoTail, I could easily see why:

The new strategy hits a loop almost 2x as often as the original. In my agent, a loop is when it picks a link, and then on that page pick the previous link again. This usually results in the run failing because it’ll likely keep picking these two.

The reason why it does this is because the best practices encourage it to go up a level in the hierarchy, which is good at the beginning of a race, but will often cause a loop towards the end.

What Next?

There are two paths I could go down from here: introduce some new instructions in the prompt to avoid the looping behaviour, or revert to the old strategy and try a different iteration. I think I’ll do the latter. Stay tuned for Experiment #2!