As more and more teams are building agents and putting them live, a new problem emerges. How do we optimize these things?

For traditional products, we knew the elements of canonical experiments: A/B tests on UX, content, and algorithms. But what’s the equivalent for agents? And where might we get the strongest ROI on an experiment?

As I see it there are 3 surface areas of an agent that we can play with:

  1. Semantic: prompts and context, everything we send to the model
  2. Hyperparameter: which models we use and how we configure them
  3. Architecture: the shape of our agent system, what tools it has and what workflow it follows

Below, I dive into each of these, giving examples of experiments that we might run.


Layer 1: The Semantic Layer

The semantic layer is everything your agent sends to models, and that mainly means the prompt and the context. These can be static system prompts, or dynamic inputs that include user input or agent state.

Some classic experiments include:

Other surfaces worth experimenting on: tool descriptions, constraint ordering, role context in multi-step workflows, history rendering (verbatim vs. summarized), input templating (raw vs. rewritten), and query reformulation before retrieval.

There’s a lot of leverage in experiments here - the exploration space of things to try is massive.


Layer 2: The Hyperparameter Layer

Hyperparameters are the dials on the model itself - the configuration of the brain you’re asking the question to.

Other dials worth experimenting on: max tokens, top-p, stop sequences, and whether you use the model’s native structured output mode vs. asking for JSON in the prompt.

Once prompts are dialed in, experimenting on the hyperparameter layer is what lets you optimize the cost/latency/quality frontier.


Layer 3: The Architecture Layer

Architecture is the shape of the system around the model. The graph, the routing, the memory, the guardrails.

Other surfaces worth experimenting on: retry logic, tool selection (which tools the agent has access to), context window strategy, and whether guardrails live inside the prompt or as a separate classifier layer.

Architecture experiments are the slowest to run and the hardest to measure - but they’re the only place where step-changes live. If you’ve tuned your prompts and your model config and you’re still hitting a ceiling, this is the layer to look at.


Closing

I believe in an analytical approach to figuring out which of these layers to focus on. You pick the layer that matches the symptom: if behaviour is wrong, start with semantics. If cost or latency is the problem, look at hyperparameters. If you’re hitting a ceiling that no amount of prompt tuning can break, it’s an architecture question.

This was just a taste of how you can optimize your agent. To learn more, head over to twotail.ai and book some time with me.