The Experiment

My first optimization experiment didn’t go well, but the reason was revealing: loops were causing the agent to fail. This was particularly true of the variant group, but also present in the control group.

So I went back and fixed this issue: by persisting the path of pages and telling the agent not to loop!

Digging into the failed race clusters, I spotted something else interesting: the model was hallucinating!

The model was mixing the target page entity with the current page, and hallucinating an incorrect idea for what the target page is. When it does that, it chooses the wrong links.

To fix this, I decided to create a step at the start of the race where an LLM defines what the entity of the target page is, and that gets carried along for future steps to reference.

This raised a new problem: sometimes the gemini flash model doesn’t really know what the target page is! It would be cheating to let it look at the page, so I decided to use a model router.

My router uses a stronger model, Gemini Pro, to define the target, then passes that along. This expensive planning + cheap execution strategy can be applied to agents generally.

The Gemini Pro description of the target, in this case Reggie the Alligator

When I was looking into this, I spotted another hallucination: sometimes the model was imagining links that weren’t there.

This happened particularly for pages with long links of links. The agent limits to 500 links, so I checked in twotail how often there were more than that.

Based on this, I increased the limit to 700 (the median), and also forced it to include the target page if it was beyond 700 in the list.

This also made me reflect on these long list pages: they are powerful but often a shorter list more niche to the target would be better. So I adjusted the prompt to prefer those.

Well - I went on a rampage fixing issues here! So it was time to deploy the latest agent and see how it performed.

The Results

Another significant result, and this time positive! We are now winning races 74% of the time, having started at around 50%.

What Next?

Looking at the clustering of failed races now, the errors are mostly gone and it’s all about reaching 10 steps without success.

That’ll be the inspiration for the next experiment: how can I get the model to choose more efficient paths?

I’ll also definitely return to the topic of cost vs quality - it’d be interesting to let gemini pro decide the links too and see what the uplift is in performance, and whether the trade-off is worth it.

BTW, if you want to be able to analyze and optimize your own agent like this, get on the waitlist for TwoTail! I’m personally helping with the first analysis for every new signup.