Home/ Articles/ The 3 types of agent optimization experiment

Frameworks

The 3 types of agent optimization experiment

Three layers to experiment on when optimizing AI agents: semantic (prompts), hyperparameter (model config), and architecture (system shape).

Timothy Daniell · Published April 16, 2026 · 6 min read

Key takeaways

Agent experiments fall into three layers: Semantic, Hyperparameter, Architecture.
Semantic layer: prompts and context — role/persona, few-shot examples, CoT trigger phrasing, tool descriptions. Huge exploration space, fast iteration.
Hyperparameter layer: model config — model ID, temperature, thinking budget. Highest-leverage single swap is usually model selection.
Architecture layer: model routing, decomposition, memory strategy. Slowest to run, hardest to measure, but the only place step-changes live.
Pick the layer by the symptom: wrong behaviour → semantics; cost or latency → hyperparameters; ceiling despite prompt tuning → architecture.

As more and more teams are building agents and putting them live, a new problem emerges. How do we optimize these things?

For traditional products, we knew the elements of canonical experiments: A/B tests on UX, content, and algorithms. But what’s the equivalent for agents? And where might we get the strongest ROI on an experiment?

As I see it there are 3 surface areas of an agent that we can play with:

Semantic: prompts and context, everything we send to the model
Hyperparameter: which models we use and how we configure them
Architecture: the shape of our agent system, what tools it has and what workflow it follows

Below, I dive into each of these, giving examples of experiments that we might run.

Layer 1: The Semantic Layer

The semantic layer is everything your agent sends to models, and that mainly means the prompt and the context. These can be static system prompts, or dynamic inputs that include user input or agent state.

Some classic experiments include:

Role / Persona: “You are a helpful assistant” vs. “You are a Socratic tutor who never gives the answer directly.” Same input, wildly different outputs.
Few-Shot Examples: include previous cases or responses. These might come from a human-labelled “golden set”, or be dynamically retrieved to match the situation
CoT Trigger Phrasing: “Think step by step” vs. “Plan, then execute.” Small wording changes in reasoning instructions can shift output quality meaningfully.

Other surfaces worth experimenting on: tool descriptions, constraint ordering, role context in multi-step workflows, history rendering (verbatim vs. summarized), input templating (raw vs. rewritten), and query reformulation before retrieval.

There’s a lot of leverage in experiments here - the exploration space of things to try is massive.

Layer 2: The Hyperparameter Layer

Hyperparameters are the dials on the model itself - the configuration of the brain you’re asking the question to.

Model ID: GPT-4o (smart, expensive) vs. Gemini Flash (fast, cheap). The single highest-leverage swap you can make, and often the first one worth testing.
Temperature: Temp 0 for deterministic classification vs. Temp 0.7 for creative generation. The right setting depends entirely on the job.
Thinking Budget: For reasoning models, how much compute you let the model spend thinking. More thinking means better answers on hard tasks, but the cost and latency add up fast.

Other dials worth experimenting on: max tokens, top-p, stop sequences, and whether you use the model’s native structured output mode vs. asking for JSON in the prompt.

Once prompts are dialed in, experimenting on the hyperparameter layer is what lets you optimize the cost/latency/quality frontier.

Layer 3: The Architecture Layer

Architecture is the shape of the system around the model. The graph, the routing, the memory, the guardrails.

Model Routing: A single model for every call vs. a cheap model first that escalates to an expensive one when confidence is low. This is where cost optimization gets serious.
Decomposition: One monolithic agent that does everything vs. a planner that spawns specialized sub-agents per step. Different failure modes, different debuggability.
Memory Strategy: No memory vs. a conversation summary vs. full vector-store recall. How much your agent remembers between runs changes what it’s capable of.

Other surfaces worth experimenting on: retry logic, tool selection (which tools the agent has access to), context window strategy, and whether guardrails live inside the prompt or as a separate classifier layer.

Architecture experiments are the slowest to run and the hardest to measure - but they’re the only place where step-changes live. If you’ve tuned your prompts and your model config and you’re still hitting a ceiling, this is the layer to look at.

Closing

I believe in an analytical approach to figuring out which of these layers to focus on. You pick the layer that matches the symptom: if behaviour is wrong, start with semantics. If cost or latency is the problem, look at hyperparameters. If you’re hitting a ceiling that no amount of prompt tuning can break, it’s an architecture question.

This was just a taste of how you can optimize your agent. To learn more, head over to twotail.ai and book some time with me.

Summary

Three layers of agent experiments: Semantic, Hyperparameter, Architecture.
Semantic changes are cheap and many. Start here when behaviour is wrong.
Hyperparameter changes let you optimize the cost/latency/quality frontier once prompts are stable.
Architecture changes are where the ceiling breaks — but they're slow and hard to evaluate.
Match the layer to the symptom. Don't rearchitect if a prompt tweak would do.

Frequently asked questions

What's the semantic layer of an agent?

Everything you send to the model: the system prompt, the user prompt, retrieved context, few-shot examples, tool descriptions, and how you format conversation history. This layer has the largest exploration space and the fastest iteration cycle.

What are the hyperparameters that matter most?

Model ID is usually the highest-leverage swap. Temperature, thinking budget (for reasoning models), max tokens, top-p, stop sequences, and structured-output mode all matter. Model choice tends to dominate the rest.

What counts as an architecture experiment?

Changes to the shape of the system: model routing (single model vs cheap-first-then-escalate), decomposition (monolith vs planner plus sub-agents), memory strategy (none vs summary vs full recall), retry logic, tool selection, guardrails.

How do I know which layer to experiment on?

Match the symptom. If behaviour is wrong, start with semantics. If cost or latency is the problem, look at hyperparameters. If you've tuned prompts and model config and you're still hitting a ceiling, it's an architecture question.

Why is architecture the slowest to experiment on?

Architecture changes are bigger rewrites that take longer to implement, have more moving parts to verify, and often affect several metrics at once. They're also harder to measure because the 'before' and 'after' aren't straightforwardly comparable when the shape of the system changes.

Ship agents you actually understand.

TwoTail turns your OpenTelemetry traces into plain-English analysis, failure clusters, and eval patterns.

Book a demo

Timothy Daniell

Founder of TwoTail. Building agent analytics for teams shipping AI agents to production.