Agent optimization experiment #1: prompt change

First Wiki Racer optimization attempt: adding 'expert strategies' to the prompt backfired, nearly doubling loop failures. Here's what happened.

Timothy Daniell · Published January 5, 2026 · 4 min read

Key takeaways

Baseline Wiki Racer agent used Gemini Flash and won ~50% of races from a basic Claude-Code-generated prompt.
Hypothesis: injecting 'expert strategies' (derived via Gemini Pro from human gameplay patterns) would improve success rate.
Experiment design: 100+ runs per variant, Chi-squared test for significance, detecting a 20pp uplift with 80% power at 5% significance.
Result: the new prompt was significantly worse — loops happened almost 2x more often.
Root cause: expert strategies encouraged hierarchy-climbing, which is great early in a race and bad late in a race when it caused repeat-clicking between two pages.

This is the first in a series here sharing my real-life attempts to optimize agents - in order to demonstrate techniques and their efficacy.

The Agent

A successful race!

For many of these, I’m going to use my “toy” agent: Wiki Racer. This is a simple agent I’ve built to play the Wikipedia game. Once an hour, it’s given a random wikipedia page to start from, and another to try and navigate to, only by clicking links in articles. The agent calls gemini-flash to make the decision of which link to pick. The race is a success if it reaches the target page in 10 clicks or fewer.

The Baseline

Chart in my agent analytics tool TwoTail.AI

The first version of my agent used a simple prompt which was generated by Claude Code when I built it. It “wins” the race around 50% of the time.

The Experiment

I decided that the first experiment I would make would be to try and improve the prompt. There are a lot of prompt optimization techniques, but I decided to start with a simple change: include in the prompt some “Expert Strategies” on how to play the game (this itself was generated by asking Gemini Pro - is this some sort of meta-prompting or knowledge distillation?).

Gemini Pro advice on how to play the wiki game

The cool thing here, is these strategies have formed based on many human games! You would learn some of these strategies yourself pretty quickly playing the game, but some are clever tricks that you’d take a while to spot. It’ll be interesting to see how much it helps the agent

For this experiment there is no traffic bias so I can simply change the prompt and observe the results “before and after”, rather than needing an A/B test. The KPI for this experiment will be race success rate, and so we can use a Chi-squared test to calculate statistical significance. I will use a sample size of 100 races per variant, which can detect a 20pp uplift in success with power 80% and significance 5% - that’s a larger minimum effect than I would like, but it’ll do for the first experiment (I can always ramp up the agent frequency later).

The Results

I left the experiment on over Christmas holidays, so the new variant ended up with 500 runs. Here are the results:

A resounding failure!

The new strategy is worse than the old one, that was unexpected.

Using TwoTail, I could easily see why:

The new strategy hits a loop almost 2x as often as the original. In my agent, a loop is when it picks a link, and then on that page pick the previous link again. This usually results in the run failing because it’ll likely keep picking these two.

The reason why it does this is because the best practices encourage it to go up a level in the hierarchy, which is good at the beginning of a race, but will often cause a loop towards the end.

What Next?

There are two paths I could go down from here: introduce some new instructions in the prompt to avoid the looping behaviour, or revert to the old strategy and try a different iteration. I think I’ll do the latter. Stay tuned for Experiment #2!

Summary

Prompting with 'expert strategies' doesn't uniformly help — it can break behaviour that was working.
Loop detection is a first-class signal for diagnosing prompt regressions.
Knowing exactly what the failure mode is (loops, not lost races) is more useful than just knowing the success rate dropped.

Frequently asked questions

What is the Wiki Racer agent?

A toy agent built to play the Wikipedia game: given a random start and target page, navigate from start to target in 10 clicks or fewer using only article links. It runs hourly and calls Gemini Flash for each link choice.

Why use Chi-squared and not a t-test?

Success rate is a binary outcome per race, so we're comparing proportions, not means. Chi-squared on a 2x2 contingency table (variant x success) is the right fit.

What is meta-prompting in this context?

Using one LLM call to generate prompt content that's then used by another LLM call. In this case: Gemini Pro was asked to produce 'expert strategies' that were then injected into the Gemini Flash runtime prompt.

Why not patch the looping behaviour instead of reverting?

The strategy caused a specific failure mode that was visible in the data. Patching would mean adding another prompt instruction to counteract the first one — a fragile path. Reverting and trying a different iteration keeps the prompt simple and interpretable.

Is a 20pp minimum detectable effect too large?

Yes, it's generous for statistical taste. It works for a first experiment because the agent was running infrequently. Ramping up the cadence would let you detect smaller effects later.

Ship agents you actually understand.

TwoTail turns your OpenTelemetry traces into plain-English analysis, failure clusters, and eval patterns.

Book a demo

Timothy Daniell

Founder of TwoTail. Building agent analytics for teams shipping AI agents to production.