Home/ Articles/ Agent optimization experiment #2: loops, hallucinations, and model routers

Experiments

Agent optimization experiment #2: loops, hallucinations, and model routers

Second Wiki Racer run: fixed looping, caught hallucinations, split planning and execution across models. Win rate went from 50% to 74%.

Timothy Daniell · Published January 14, 2026 · 5 min read

Key takeaways

Tracking visited pages and blocking revisits eliminated most loops — the dominant failure mode from experiment #1.
Hallucination mode 1: the agent mixed up the target entity with the current page. Fix: a dedicated 'target definition' step at race start using a stronger model.
This introduced a 'model router' pattern — expensive Gemini Pro for planning, cheap Gemini Flash for execution.
Hallucination mode 2: the agent imagined links that didn't exist when pages had >500 links (the agent's list limit).
Fix: raise the limit to 700 (the median page-link count in the data), and force the target page to appear in the list when it exists.
Result: win rate went from ~50% to 74% — a significant jump.

The Experiment

My first optimization experiment didn’t go well, but the reason was revealing: loops were causing the agent to fail. This was particularly true of the variant group, but also present in the control group.

So I went back and fixed this issue: by persisting the path of pages and telling the agent not to loop!

Digging into the failed race clusters, I spotted something else interesting: the model was hallucinating!

The model was mixing the target page entity with the current page, and hallucinating an incorrect idea for what the target page is. When it does that, it chooses the wrong links.

To fix this, I decided to create a step at the start of the race where an LLM defines what the entity of the target page is, and that gets carried along for future steps to reference.

This raised a new problem: sometimes the gemini flash model doesn’t really know what the target page is! It would be cheating to let it look at the page, so I decided to use a model router.

My router uses a stronger model, Gemini Pro, to define the target, then passes that along. This expensive planning + cheap execution strategy can be applied to agents generally.

The Gemini Pro description of the target, in this case Reggie the Alligator

When I was looking into this, I spotted another hallucination: sometimes the model was imagining links that weren’t there.

This happened particularly for pages with long links of links. The agent limits to 500 links, so I checked in twotail how often there were more than that.

Based on this, I increased the limit to 700 (the median), and also forced it to include the target page if it was beyond 700 in the list.

This also made me reflect on these long list pages: they are powerful but often a shorter list more niche to the target would be better. So I adjusted the prompt to prefer those.

Well - I went on a rampage fixing issues here! So it was time to deploy the latest agent and see how it performed.

The Results

Another significant result, and this time positive! We are now winning races 74% of the time, having started at around 50%.

What Next?

Looking at the clustering of failed races now, the errors are mostly gone and it’s all about reaching 10 steps without success.

That’ll be the inspiration for the next experiment: how can I get the model to choose more efficient paths?

I’ll also definitely return to the topic of cost vs quality - it’d be interesting to let gemini pro decide the links too and see what the uplift is in performance, and whether the trade-off is worth it.

—

BTW, if you want to be able to analyze and optimize your own agent like this, get on the waitlist for TwoTail! I’m personally helping with the first analysis for every new signup.

Summary

Persist state (the path so far) and instruct the agent not to revisit — a small change, a big fix.
Stronger model for planning, cheaper model for execution, is a broadly useful cost/quality pattern.
Data-driven limits beat gut-feel limits — increasing the link cap from 500 to 700 was grounded in the median observed page.
The new bottleneck is path efficiency, not avoidable errors.

Frequently asked questions

What is a model router?

A step in the agent where it decides which model to use for the next call — typically routing to a cheaper model by default and escalating to a more capable one when the task is harder or confidence is low.

Why not use the strong model for every call?

Cost. Running the whole race on Gemini Pro would be several times more expensive per run. The expensive model only earns its keep on calls where the cheaper model demonstrably fails — target definition in this case.

How did you detect the hallucination about imagined links?

By inspecting failed traces in TwoTail. The agent would claim to click a link that wasn't actually in the page's link list. Checking how often pages exceeded the 500-link cap confirmed the mechanism — over-long pages were getting truncated before the target link was reached.

Why is 700 the right link cap?

It's the median number of links on pages the agent visits, based on observed data. Setting the cap at the median is a reasonable first-pass rule: it keeps the prompt short for most pages while catching the long-tail pages that were hitting the old 500 ceiling.

What's the next optimization target?

Path efficiency. With loops and hallucinations mostly fixed, remaining failures are cases where the agent reaches 10 steps without winning. The obvious experiment is whether letting Gemini Pro pick links as well as define targets would improve performance enough to justify the cost.

Ship agents you actually understand.

TwoTail turns your OpenTelemetry traces into plain-English analysis, failure clusters, and eval patterns.

Book a demo

Timothy Daniell

Founder of TwoTail. Building agent analytics for teams shipping AI agents to production.