Evals Are a Design Problem, Not Just an Engineering Task

How to evaluate AI agents beyond outputs: a practical guide to agentic UX, behavior design, and reliable AI systems

May 25, 2026

∙ Paid

Part of the series Agentic Experience Design, published every Monday.

AI agents are harder to evaluate than chatbots because they do not only generate answers. They choose tools, use context, follow policies, make decisions, recover from failure, and sometimes act on behalf of the user.

In this article, you’ll learn why evaluating only the final answer is no longer enough, why task completion can hide bad behavior, and why designers need to be involved in defining what “good AI behavior” means before a system reaches production.

The free section explains the shift from output review to behavior evaluation: why agents fail through wrong actions, loops, missed constraints, poor repair, and unsafe completion. The paid section turns that argument into a practical layer you can use with product and engineering teams: an 8-dimension Behavioral Eval Matrix, a simple way to write eval criteria, enterprise-style examples, and a designer’s lens for reviewing agent traces.

A few years ago, evaluating an AI system often meant asking a relatively simple question:

Did it give the right answer?

Not always an easy question, of course. But still a familiar one.

Was the response accurate? Was it relevant? Was it fluent? Did it follow the instruction? Did it avoid hallucinating? Did it sound acceptable in the context where it appeared?

This made sense when the system’s main job was to produce language.

A chatbot answered a question. A summarization model summarized a document. A support assistant suggested a reply. A conversational flow moved from one turn to the next.

The output was the thing we could inspect.

But agentic systems change the object of evaluation.

An agent does not only answer. It may retrieve information, choose a tool, call an API, use memory, update a record, search across systems, plan across steps, recover from failure, ask for clarification, escalate to a human, or decide not to act.

At that point, the question is no longer only whether the answer was correct.

The better question is whether the system was right to behave the way it did.

That is why evals are becoming a design problem.

Not only an engineering task.

Join the Community

The object of evaluation has changed

The shift is already visible across the field: evaluating an agent is no longer the same as evaluating a language model in isolation.

A language model can be tested on whether it produces a correct, fluent, relevant, or grounded answer. An agent needs to be evaluated on something broader. It may need to reason across steps, choose between tools, operate under partial information, remember relevant context, interact with external systems, and adapt when the environment changes.

This matters for design because the unit of quality is changing.

In traditional conversation design, quality often lived at the level of the turn. Did the system understand the intent? Did the response make sense? Did the fallback help the user recover? Did the handover happen at the right moment?

Those questions still matter.

But they are no longer enough.

In agentic systems, the unit of quality is not only the turn. It is the trajectory.

The system interprets a goal, chooses a path, acts, observes the result, adjusts, and continues. The user may only see the final response, but the experience has already been shaped by a series of invisible decisions. A poor decision early in that trajectory can still produce a polished final answer. That is precisely what makes agentic systems difficult to evaluate from the surface.

A system can produce a polished final message after using the wrong source. It can complete a task while skipping a required confirmation. It can answer with confidence while the evidence behind that answer is incomplete. It can retrieve the right data and still use it in the wrong way. It can even resolve the user’s request while violating a policy boundary that the user never sees.

From the outside, the answer may look acceptable.

Inside the system, the behavior may be unsafe, unreliable, expensive, or simply wrong.

That is the gap evals need to expose.

A wrong answer is only one kind of failure

When agents are tested in interactive environments rather than static prompts, a different pattern emerges.

The interesting failures are no longer only wrong answers. They are failed actions, broken interaction contracts, loops, missed constraints, poor recovery, and decisions that looked reasonable locally but moved the system in the wrong direction.

This is important for designers because these failures are behavioral.

An invalid format is not just a formatting issue. It means the system failed to respect the contract required to continue the interaction. An invalid action is not just a technical mistake. It means the system attempted something outside the allowed action space. A repeated loop is not just inefficiency. It means the system does not know how to change strategy when its current path stops working.

In other words, the agent did not simply “answer badly.”

It failed to behave within the boundaries of the task.

This is an important shift. In an agentic system, a wrong answer is only one kind of failure. A wrong action, a repeated loop, a missed constraint, a premature decision, or a failure to stop can be just as important.

And even completion is not enough.

A task can be completed and still be badly designed. It can be completed too slowly, with too much hidden uncertainty, with unnecessary tool calls, with weak evidence, with poor recovery, or through a path that violates policy. Task completion tells us whether the system reached the goal. It does not tell us whether it respected the path.

That is one of the central mindset shifts in Agentic Experience Design:

Task completion is necessary.

It is not sufficient.

The action space is part of the experience

In many AI product teams, designers are still brought in too late.

The model has been selected. The architecture already exists. The tools are connected. The prompts are written. A first evaluation setup may already be in place.

Then design is asked to improve the interface, the wording, the tone, or the user-facing flow.

But in an agentic system, the experience is not only what the system says.

It is also what the system is allowed to do.

The action space is part of the experience.

When a system can act, the boundaries of action become part of the design material. Whether the agent can update an account, access billing data, send an email, trigger a workflow, change a setting, or escalate to a human is not just an implementation detail. It defines the relationship between the user and the system.

A system that can only recommend behaves differently from a system that can execute. A system that must ask before acting creates a different experience from one that acts silently in the background. A system that can see internal notes, customer history, or payment information creates a different trust contract from one that only works with public documentation.

These decisions shape autonomy, risk, permission, and trust.

That is why evals cannot be designed only around output quality. They need to test whether the system behaves correctly inside the boundaries of its role.

A useful eval has to look at the conditions around the action. It is not enough to know that the agent completed the task. We need to know whether it had enough evidence to act, whether the action was allowed for that user, whether it respected the relevant policy, and whether it should have asked for confirmation before moving forward.

The system’s behavior is not separable from the constraints around it.

The evaluation should not be either.

Loops are failed repair strategies

One of the most revealing things about agentic systems is how often they fail by getting stuck.

The system does not always collapse dramatically. Sometimes it simply repeats. It tries the same action again. It asks the same question again. It follows the same path with slightly different wording. It receives feedback from the environment but does not know how to use that feedback to change strategy.

A loop is not just inefficiency.

A loop is a failed repair strategy.

It means the system has no useful policy for what to do when its current plan stops working. It does not know how to reinterpret the problem. It does not know how to ask for clarification. It does not know how to try a different route. It does not know how to reduce the scope. It does not know how to escalate. It does not know how to stop.

In traditional chatbot design, we knew a version of this problem as the fallback loop.

The user says something. The bot does not understand. The bot asks again. The user rephrases. The bot still does not understand. The conversation collapses.

Agentic systems create a more operational version of the same failure.

The system acts. The environment returns an error. The system tries again. The action fails. The system apologizes. The system repeats. The system tries the same path with slightly different wording.

The surface may look more intelligent.

The failure pattern is familiar.

Repair is still the difference between a system that can recover and a system that makes the user pay for its confusion.

This is why evals need to include repair behavior. Not only answer correctness. Not only task success. Not only tool accuracy. But what happens after the system gets stuck.

A system that cannot recover from failure should not be evaluated only by the moments where everything works.

“Good” is no longer a single metric

In agentic systems, there is no single metric for good behavior.

A system may be strong on task completion and weak on reliability. It may use tools correctly but handle uncertainty poorly. It may produce a good answer but take too long or cost too much. It may sound fluent but behave unsafely. It may work in a static test and fail in a dynamic environment. It may perform well in a demo and fail under repeated use.

This is where many AI product conversations become too narrow.

Teams often ask whether the system works. But in agentic systems, “works” needs to be unpacked. Does it work once, or reliably? Does it work in the happy path, or under noisy input? Does it work when tools return errors? Does it work when the user has limited permissions? Does it work when the right answer depends on policy, context, or risk?

The more agentic the system becomes, the more evaluation needs to move from a single success metric to a behavioral profile.

That profile includes visible user experience qualities, such as clarity, usefulness, latency, and interaction quality. It also includes less visible operational qualities, such as tool selection, parameter accuracy, memory behavior, robustness, cost, and compliance.

For designers, this is not a reason to become ML engineers.

It is a reason to become much more precise about what kind of behavior the system should be evaluated against.

Enterprise agents need enterprise evals

This becomes especially visible in enterprise contexts.

A consumer assistant can be impressive while still being inconsistent. An internal enterprise agent cannot rely on charm, fluency, or occasional success. It has to operate inside permission structures, compliance requirements, data boundaries, audit expectations, and domain-specific policies.

In an enterprise setting, the agent does not simply need to find the right answer. It needs to know whether this user is allowed to access that answer. It does not simply need to complete the task. It needs to complete it within policy. It does not simply need to be helpful. It needs to be auditable, reliable, and safe across repeated use.

A support agent, for example, may need to distinguish between information it can share directly, information it can use internally but not expose, and information that requires human review. A finance agent may need to follow approval thresholds. A medical or legal assistant may need to refuse certain requests, add disclaimers, or escalate to a qualified professional. An internal operations agent may need to respect role-based permissions before retrieving or modifying data.

In these contexts, a task can look successful while still being unacceptable.

That is why enterprise evals are not just harder benchmarks.

They are evaluations of permission, policy, reliability, and consequence.

Evaluation is where boundaries become measurable

For a long time, many AI discussions treated guardrails as something separate from evaluation.

There is the model. There is the experience. There are the guardrails. There are the evals.

But in agentic systems, these layers are deeply connected.

A guardrail that is not evaluated is only an intention. A policy that is not tested is only a document. A boundary that is not operationalized is only a principle.

Evaluation is where boundaries become measurable.

If the system should ask before taking action, evals need to test that. If the system should escalate when confidence is low, evals need to test that. If the system should not use memory without consent, evals need to test that. If the system should not expose restricted information, evals need to test that. If the system should recover after a failed tool call, evals need to test that. If the system should stop after repeated failure, evals need to test that.

This is where design becomes concrete.

Not in the abstract statement that the system should be “trustworthy”. But in the specific behavioral conditions under which the system should answer, ask, act, retry, refuse, escalate, or stop.

That is one of the reasons evals belong in Agentic Experience Design.

They are not only how we measure whether a system works.

They are how we make explicit what kind of behavior the system is allowed to perform.

Why designers belong in the eval loop

There is a reason evals are often treated as engineering work.

They involve datasets, metrics, test cases, automation, traces, tooling, monitoring, and infrastructure. All of that matters.

But the fact that evals are technically implemented does not mean they are only technically defined.

Before a team can measure whether behavior is good, someone has to define what good behavior means.

That definition is not purely technical.

It includes user expectations, domain risk, interaction quality, autonomy boundaries, tone under uncertainty, repair patterns, escalation thresholds, permission logic, evidence standards, policy interpretation, and trust.

This is exactly where designers should contribute.

Not by replacing engineering. Not by owning model evaluation alone. Not by turning every designer into a benchmark specialist.

But by helping define the behavioral criteria that engineering will later test.

A designer should be able to look at an agentic system and ask what the agent should never do, when it should ask before acting, what kind of uncertainty should be visible to the user, what counts as a recoverable failure, and what requires escalation.

A designer should be able to help define what safe completion looks like.

Not just successful completion.

Safe completion.

Useful completion.

Legible completion.

Completion within boundaries.

That is a different level of design work from polishing the final message.

It is also where the profession is moving.

From understanding the problem to designing the eval layer

At this point, the argument is clear.

Evals are not only how teams measure whether an agent works. They are how teams define what good behavior means before the system acts in the real world.

But knowing why evals matter is not the same as knowing how to design them.

The difficult part is translation.

How do you move from a statement like “the agent should be trustworthy” to a concrete eval criterion? How do you define what “safe completion” means? How do you decide which behaviors need to be tested before engineering turns them into metrics, traces, graders, or dashboards?

That is where the practical layer starts.

For paid subscribers: the behavioral evaluation layer

The rest of this article moves from the argument to the application.

Here is exactly what you’ll get:

Use the 8-dimension Behavioral Eval Matrix
Define what “good behavior” means before the team starts writing metrics. The matrix covers task completion, tool use, uncertainty handling, repair, escalation, safety boundaries, memory behavior, and user-visible quality.

Write eval criteria designers can contribute to
Turn vague expectations like “the agent should be safe” into specific, testable behavioral rules that product, design, engineering, and domain teams can align on.

Adapt enterprise-style examples
Work from cancellation flows, API failures, permission boundaries, and cases where the agent must not complete the task even if it technically can.

Review traces with a designer’s lens
Learn what to look for in the agent’s path: goal understanding, autonomy level, evidence use, repair, escalation, and behavioral legibility.

This is not a technical deep dive into eval infrastructure.

It is the layer between design judgment and measurable behavior.

Subscribe now to continue reading, or use the €60 annual access option below.

Access note: If the standard annual subscription is not realistic in your current context, there is a permanent access option available: €60 per year, with the same full access and no expiry date. Redeem it here.

The behavioral eval matrix

A behavioral eval matrix is not a replacement for technical evaluation.

It is a translation layer.

It helps designers, product managers, researchers, engineers, and domain experts align on what the system should be tested for before the evaluation becomes a dataset, a metric, a grader, or a dashboard.

This matters because many teams move too quickly from “we need evals” to “what metric should we use?” But a metric is only useful after the team has clarified what kind of behavior the system is supposed to produce.

If the system should answer only when evidence is strong, then the eval needs to test evidence quality. If the system should ask before taking action, the eval needs to test confirmation behavior. If the system should escalate when policy becomes ambiguous, the eval needs to test escalation thresholds. If the system should recover after tool failure, the eval needs to test repair behavior, not only final task success.

The matrix is designed to make those decisions visible before they disappear into implementation.

Continue reading this post for free, courtesy of Dr. Carmen Martinez.

Or purchase a paid subscription.

Agentic UX & Conversation Design