As more and more teams are building agents and putting them live, a new problem emerges. How do we optimize these things?
For traditional products, we knew the elements of canonical experiments: A/B tests on UX, content, and algorithms. But what’s the equivalent for agents? And where might we get the strongest ROI on an experiment?
As I see it there are 3 surface areas of an agent that we can play with:
- Semantic: prompts and context, everything we send to the model
- Hyperparameter: which models we use and how we configure them
- Architecture: the shape of our agent system, what tools it has and what workflow it follows
Below, I dive into each of these, giving examples of experiments that we might run.
Layer 1: The Semantic Layer
The semantic layer is everything your agent sends to models, and that mainly means the prompt and the context. These can be static system prompts, or dynamic inputs that include user input or agent state.
Some classic experiments include:
- Role / Persona: “You are a helpful assistant” vs. “You are a Socratic tutor who never gives the answer directly.” Same input, wildly different outputs.
- Few-Shot Examples: include previous cases or responses. These might come from a human-labelled “golden set”, or be dynamically retrieved to match the situation
- CoT Trigger Phrasing: “Think step by step” vs. “Plan, then execute.” Small wording changes in reasoning instructions can shift output quality meaningfully.
Other surfaces worth experimenting on: tool descriptions, constraint ordering, role context in multi-step workflows, history rendering (verbatim vs. summarized), input templating (raw vs. rewritten), and query reformulation before retrieval.
There’s a lot of leverage in experiments here - the exploration space of things to try is massive.
Layer 2: The Hyperparameter Layer
Hyperparameters are the dials on the model itself - the configuration of the brain you’re asking the question to.
- Model ID: GPT-4o (smart, expensive) vs. Gemini Flash (fast, cheap). The single highest-leverage swap you can make, and often the first one worth testing.
- Temperature: Temp 0 for deterministic classification vs. Temp 0.7 for creative generation. The right setting depends entirely on the job.
- Thinking Budget: For reasoning models, how much compute you let the model spend thinking. More thinking means better answers on hard tasks, but the cost and latency add up fast.
Other dials worth experimenting on: max tokens, top-p, stop sequences, and whether you use the model’s native structured output mode vs. asking for JSON in the prompt.
Once prompts are dialed in, experimenting on the hyperparameter layer is what lets you optimize the cost/latency/quality frontier.
Layer 3: The Architecture Layer
Architecture is the shape of the system around the model. The graph, the routing, the memory, the guardrails.
- Model Routing: A single model for every call vs. a cheap model first that escalates to an expensive one when confidence is low. This is where cost optimization gets serious.
- Decomposition: One monolithic agent that does everything vs. a planner that spawns specialized sub-agents per step. Different failure modes, different debuggability.
- Memory Strategy: No memory vs. a conversation summary vs. full vector-store recall. How much your agent remembers between runs changes what it’s capable of.
Other surfaces worth experimenting on: retry logic, tool selection (which tools the agent has access to), context window strategy, and whether guardrails live inside the prompt or as a separate classifier layer.
Architecture experiments are the slowest to run and the hardest to measure - but they’re the only place where step-changes live. If you’ve tuned your prompts and your model config and you’re still hitting a ceiling, this is the layer to look at.
Closing
I believe in an analytical approach to figuring out which of these layers to focus on. You pick the layer that matches the symptom: if behaviour is wrong, start with semantics. If cost or latency is the problem, look at hyperparameters. If you’re hitting a ceiling that no amount of prompt tuning can break, it’s an architecture question.
This was just a taste of how you can optimize your agent. To learn more, head over to twotail.ai and book some time with me.