Complete guide to evaluating AI agents in production: dimensions, approaches, Python implementation. Case study shows 80% reduction in critical errors with Latitude.

César Miguelañez

By Latitude · March 23, 2026
Key Takeaways
Agent evaluation requires goal-level assessment across multi-turn sessions — an agent can produce high-quality individual responses at every turn and completely fail the user's intent.
Tool response misinterpretation is the most dangerous failure mode: the tool returns a valid response, the agent interprets it wrong, and all downstream reasoning is corrupted — without any errors in logs.
A 20-step agent workflow with 95% per-step reliability succeeds only 36% of the time overall; compounding errors are invisible at the step level and require full session trace analysis to detect.
Human annotation is the rate-limiting step in any evaluation pipeline — tooling that prioritizes the right traces for review (anomaly signals, low quality scores) is what makes it tractable at scale.
Case study: connecting production failures to pre-deployment tests via an annotation-to-eval loop reduced critical errors by 80% and improved task completion rate from 62% to 78% in 8 weeks.
The six-step sequence (instrument → identify → build eval cases → CI gate → weekly annotation → automate loop) delivers the most quality improvement per engineering hour at any production stage.
There's no perfect evaluation system for AI agents. Anyone who tells you otherwise is selling a benchmark. The goal of this guide is more modest: to help you build an evaluation system that's better than what you have today, grounded in production reality, and maintainable as your agents evolve.
This guide is written for AI engineers and heads of AI who are past the prototype phase — your agents are in production, real users are affected by failures, and you need a systematic way to improve quality rather than firefighting individual incidents. We use concrete examples, include code, and acknowledge complexity honestly.
Why Agent Evaluation Is Different
Multi-Turn Conversation Complexity
Single-turn LLM evaluation has a clean structure: input → output → score. You can maintain a dataset of (input, expected output) pairs, run your model, and compute pass rates. This works because each evaluation is independent — the model's response to prompt A doesn't affect its response to prompt B.
Agent evaluation breaks this structure. An agent session is a sequence of interdependent steps where each output becomes part of the input for the next step. Consider the difference:
Single-turn: "What are the steps to reset a password?" → Evaluate correctness of the response.
Multi-turn agent: User describes a problem → agent asks clarifying questions → user provides partial information → agent calls a tool → tool returns ambiguous result → agent interprets it → makes a recommendation → user pushes back → agent revises. Evaluating this requires assessing whether the agent's goal was accomplished, whether each tool call was correct, and whether the agent's reasoning across turns was coherent — not just whether each individual response was well-formed.
Standard LLM eval frameworks evaluate each turn in isolation. Agent evaluation frameworks must evaluate the session as a whole.
Tool Use and Function Calling Challenges
When an agent calls an external tool, three distinct failure modes become possible that don't exist in single-call LLM evaluation:
Wrong tool selection: The agent called a tool that wasn't appropriate for the current step — or failed to call a tool when one was needed.
Incorrect parameters: The agent called the right tool but passed incorrect or malformed parameters.
Misinterpreted response: The tool returned a valid response, but the agent interpreted it incorrectly — building downstream reasoning on a wrong premise.
The third failure mode is the most dangerous and the hardest to detect. The tool call returned a 200. The LLM call that followed it returned a coherent response. Nothing in your error logs indicates a problem. The failure only becomes visible when you trace what the agent did with the tool's response — which requires capturing not just the tool call but the agent's subsequent reasoning.
Non-Determinism and State Management
LLM outputs are non-deterministic, but at the single-call level you can often treat them as approximately deterministic for evaluation purposes — run the same prompt multiple times and the distribution of outputs is stable enough to build a pass/fail criterion around.
Agent execution paths are non-deterministic in a structurally different way: the same initial input can trigger different tool calls, different branching decisions, and different terminal states depending on stochastic choices made early in the session. A test case where the agent passed last week may fail this week not because of a regression but because the execution path happened to branch differently.
This means agent evaluation requires statistical assessment over multiple runs, not binary pass/fail on single executions. And it requires understanding which failures are systematic (the agent reliably fails this class of input) versus stochastic (the agent sometimes fails inputs like this). These require different responses — systematic failures indicate a prompt or model issue; stochastic failures indicate insufficient robustness in edge-case handling.
Goal-Level vs. Response-Level Quality
The most fundamental difference: for agents, quality is measured at the goal level, not the response level. An agent can produce individually excellent responses at every turn and completely fail the user's intent. Conversely, an agent can produce responses that look rough by stylistic metrics but consistently accomplish what users need.
This means LLM-as-judge evaluations that score individual responses are necessary but not sufficient for agent evaluation. You need session-level outcome assessment — did the agent accomplish the goal? — which requires either human annotation or a sophisticated goal-aware evaluation function that understands the task context.
Core Evaluation Dimensions
1. Task Completion Accuracy
Definition: Did the agent successfully accomplish the user's stated or implied goal within the session? This is the primary success metric for any agent.
Why it matters: All other dimensions are diagnostic — they help you understand why the agent succeeded or failed at the task. Task completion is the terminal metric. An agent that scores well on conversation coherence and tool use correctness but consistently fails the task is failing at the thing that matters.
How to measure: Task completion is the hardest dimension to measure automatically because it requires goal-level understanding. Three approaches, in order of reliability:
Human annotation: Domain experts review sessions and classify outcomes (goal accomplished / partially accomplished / failed). High reliability, doesn't scale beyond a few hundred sessions per week without tooling support.
Goal-aware LLM judge: Provide the evaluating model with the user's goal, the agent's full session trace, and criteria for what "accomplished" means. More scalable, but requires careful judge design to avoid rewarding good-sounding responses that didn't accomplish the task.
Proxy metrics: Session completion rates, user satisfaction signals (explicit feedback, session abandonment, re-submission of the same request in a new session). Lower reliability but available at scale from production data without additional instrumentation.
2. Tool Use Correctness
Definition: For each tool call in the agent session: (a) was the correct tool called? (b) were correct parameters passed? (c) was the tool's response correctly interpreted?
Why it matters: Tool calls are the primary interface through which agents affect the external world and gather information. Incorrect tool use corrupts the agent's reasoning context silently — the downstream effects of a misinterpreted tool response compound through subsequent turns without generating an error.
How to measure:
Tool selection accuracy: Compare the agent's tool selection to a reference set of correct tools for the given step. Requires labeled data or an oracle that knows which tools are appropriate for which contexts.
Parameter validation: Schema validation against expected parameter types and ranges catches format errors. Semantic validation (are these parameters correct for this goal?) requires a judge or human review.
Response interpretation correctness: The hardest to measure. Compare what the agent claimed about the tool's response to what the response actually said. Catching subtle misinterpretations requires a judge that reads both the tool response and the agent's subsequent reasoning.
3. Conversation Coherence Across Turns
Definition: Does the agent maintain consistent context, constraints, and goals across the full session? Does it "remember" what was established in early turns and apply it appropriately in later turns?
Why it matters: Context window saturation and reasoning drift both manifest as coherence failures. Research on LLM attention mechanisms consistently shows degraded performance on information in the middle of long contexts ("lost in the middle"). For agents with sessions above a certain length, coherence failures become systematic.
How to measure:
Constraint consistency check: identify explicit constraints or preferences stated in early turns, verify they're respected in later outputs
Session-length-stratified quality analysis: compare task completion rates across short, medium, and long sessions to detect coherence degradation
Human annotation of selected long sessions looking specifically for "forgotten context"
4. Safety and Guardrail Compliance
Definition: Does the agent comply with defined safety boundaries — refusing inappropriate requests, avoiding PII leakage, maintaining appropriate scope?
Why it matters: Safety failures in agents can compound across turns in ways that single-call safety evaluations don't catch. An agent can be manipulated through multi-turn jailbreaking where each individual turn appears safe but the sequence crosses a boundary.
How to measure: Red teaming (systematic adversarial testing pre-deployment, e.g., via Promptfoo), real-time guardrails on production traffic (e.g., Galileo's Luna models), and periodic human review of flagged sessions. Safety evaluation should run on 100% of production traffic, not sampled — low-frequency safety failures are often the highest-severity ones.
5. Latency and Resource Efficiency
Definition: Total session latency, per-turn latency distribution, tool call latency contribution, and cost per completed session.
Why it matters: Agent sessions involving multiple LLM calls and tool invocations can incur latency that accumulates to user-perceptible delays. Cost per session (not per call) determines the unit economics of the agent product. Sessions that fail after many expensive tool calls represent pure cost with no value delivered.
How to measure: Instrument session-level latency and cost from the tracing layer. Track cost per successfully completed session (not just per call) to measure unit economics. Alert on sessions with anomalously high tool call counts, which often indicate agent loops.
Evaluation Approaches
1. Production-Based Evaluation
Real production sessions are the highest-quality signal for agent evaluation because they represent the actual distribution of user behavior. Production-based evaluation means: instrument your production agent, capture full session traces, and evaluate quality on real sessions rather than synthetic ones.
Pros: Captures failure modes that synthetic data doesn't anticipate. Self-updating — as user behavior evolves, the evaluation signal evolves with it. Doesn't require upfront dataset curation.
Cons: Requires production traffic to be useful. Evaluating on 100% of production sessions at the response level is expensive (LLM-as-judge costs). Human review doesn't scale beyond a few hundred sessions per week without tooling support.
When to use: As your primary ongoing quality signal once you have meaningful production traffic. Complement with synthetic simulation for pre-deployment testing.
2. Simulation-Based Testing
Simulate multi-turn agent scenarios synthetically — generate realistic user conversations, run them through the agent, and evaluate outcomes — before deploying changes.
Pros: Enables pre-deployment testing of failure modes not yet seen in production. Can test edge cases systematically. Doesn't require production traffic.
Cons: Synthetic scenarios inevitably miss the long tail of real user behavior. Simulation quality depends on how realistically you can model user behavior. "Teaching to the test" risk: agents optimized on synthetic scenarios may not generalize.
When to use: Pre-deployment validation, particularly after model updates or significant prompt changes. Combine with production-based evaluation — simulation tests what you expect, production reveals what you didn't.
3. Human Annotation Workflows
Domain experts review production sessions, classify outcomes, and identify failure modes. Human annotation is the highest-quality evaluation signal available because human judges understand domain-specific quality criteria that automated metrics can't capture.
Pros: Captures subtle, domain-specific quality failures that LLM-as-judge evaluators miss. Provides ground truth for training better automated evaluators. Directly captures goal-level quality assessment.
Cons: Doesn't scale without tooling support. Rate-limited by the availability of qualified domain experts. Requires careful workflow design to ensure annotators are reviewing the sessions most likely to contain useful signal rather than random samples.
When to use: As the primary quality signal early in production when you're still learning what failure modes look like. As the ground truth layer for calibrating automated evaluators. As the rate-limiting step in the production-to-eval loop — human annotations define what failure means for your specific product.
4. Auto-Generated Evaluations from Production Data
Convert production failure patterns — identified through annotation and clustering — into evaluations that run automatically before each deployment.
Pros: Eval library grows from real production failures rather than from the team's prior assumptions about failure modes. Stays aligned with production reality as usage patterns evolve. Removes the manual bottleneck of eval case authorship.
Cons: Quality is bounded by annotation quality — garbage in, garbage out. Requires an annotation workflow to generate the source signal. The GEPA approach (Latitude's implementation) adds algorithm complexity in exchange for automation.
When to use: As the primary mechanism for growing your eval suite at scale. The goal is a library where every known production failure pattern has a corresponding evaluation that would have caught it pre-deployment.
Building an Evaluation Pipeline
Step 1: Instrument Agent Traces
Every session must be captured as a connected trace — all turns, tool calls, intermediate reasoning steps, and state changes — with a consistent session identifier. Here's a Python implementation using the OpenTelemetry standard:
Step 2: Define Product-Specific Success Criteria
Generic eval metrics (coherence, relevance, fluency) are insufficient for agents. Define success criteria specific to your product.
Step 3: Run Evaluations in CI/CD
Step 4: Regression Detection and Alerting
After deployment, monitor whether metrics regressed. Track a rolling baseline and alert on statistically significant drops:
Platform Comparison Framework
When selecting an evaluation platform, the key question is whether the platform was built for agents or retrofitted from LLM monitoring. The architectural difference determines which failure modes surface naturally.
Platform | Session Tracing | Issue Discovery | Eval Generation | Best For |
|---|---|---|---|---|
Latitude | Native — causal traces | Issue lifecycle + GEPA | Auto from annotations | Production agents, closed eval loop |
Braintrust | Supported | Topics (beta) | Manual — CI gates | Eval-driven development |
Langfuse | Strong | None | Manual | Self-hosted; data residency |
LangSmith | LangChain-native | Insights (partial) | Manual | LangChain teams |
Arize Phoenix | OTel-native | None | LLM-as-judge | OTel teams; open-source |
Case Study: Reducing Critical Errors by 80%
A B2B SaaS company operates an AI support agent handling billing and account management for enterprise customers. Before implementing a structured evaluation pipeline, the team's quality process was reactive: customers reported issues, the team investigated logs, found the root cause, and fixed it.
The problem: Three categories of failure were recurring despite manual fixes. Billing agents were occasionally misinterpreting account status responses from the billing API — treating "pending" states as "active" and providing incorrect information to customers. Escalations to human agents were inconsistent — the agent would handle some escalation-warranted cases itself and escalate trivial cases that should have been self-resolved. Session coherence degraded in conversations above 10 turns, where the agent would lose track of constraints established at the start.
Before metrics: Task completion rate ~62%. Critical error rate (incorrect billing information provided) ~4.5%. Average support ticket reopened rate 23% (indicating the agent failed to actually resolve the issue).
The evaluation setup:
Instrumented full session traces with tool call capture (billing API calls now fully traced including response content)
Domain expert annotators (support leads) reviewed 200 production sessions per week through a prioritized queue — not random sampling, but anomaly-signal-prioritized review
Three failure mode categories identified and tracked: billing API misinterpretation, escalation correctness, context coherence at 10+ turns
Eval cases generated from annotated failures — 47 cases in the first month
Eval suite added to CI, blocking deployments if billing API misinterpretation rate exceeded 1% on the eval set
After metrics (8 weeks): Task completion rate 78% (+16 points). Critical error rate 0.9% (80% reduction). Ticket reopened rate 14% (-9 points). The billing API misinterpretation failures — previously only visible after customer complaints — were now caught in CI before deployment.
Key lesson: The highest-value improvement wasn't a better eval framework or a better model. It was connecting the production failures the team already knew about to pre-deployment tests that would catch them. The failure patterns existed; they just weren't reflected in the eval suite until the annotation-to-eval loop was built.
Getting Started: A Practical Checklist
If you're operating production agents and don't yet have a systematic evaluation process, here's the sequence that delivers the most value fastest:
Instrument full session traces — not just LLM calls. Tool calls, state changes, all turns connected by session ID. This is the prerequisite for everything else.
Identify your top 3 failure categories — manually review 50 production sessions. You'll find the patterns quickly. Write them down as named failure modes.
Build 20 eval cases from those failures — real production failures converted to (input, context, expected outcome) triples. 20 cases is enough to start gating deployments.
Add eval run to your deployment process — block on the 1-2 criteria that represent genuine safety or quality regressions. Not everything needs to be a blocking check.
Set up a weekly annotation session — 2 hours per week of domain expert session review is enough to continuously grow your eval dataset from production failures.
Automate the production-to-eval loop — either manually by adding production incidents to your eval dataset, or by adopting a platform that does this automatically.
The teams that achieve stable, measurable improvement in agent quality all converge on the same loop: instrument production → identify failures → convert failures to evals → test before deployment → monitor for regression → repeat. The tooling you use for each step matters less than having all the steps connected.
Latitude provides a 30-day free trial with no credit card required. The self-hosted option is free with no feature restrictions. If you're starting from scratch, the fastest path to having a working evaluation loop is to instrument production traces, run the annotation queue for two weeks, and let the issue dashboard show you which failure modes to build your eval suite around.
Frequently Asked Questions
What are the core dimensions for evaluating AI agents in production?
Production AI agent evaluation requires five core dimensions: (1) Task completion accuracy — did the agent accomplish the user's goal? This is the primary success metric; all other dimensions are diagnostic. (2) Tool use correctness — was the right tool called, with correct parameters, and was the response correctly interpreted? (3) Conversation coherence across turns — does the agent maintain consistent context and constraints across the full session? (4) Safety and guardrail compliance — does the agent respect defined safety boundaries across multi-turn manipulation attempts? (5) Latency and resource efficiency — session-level latency and cost per successfully completed session.
How do you build an evaluation pipeline for production AI agents?
A production AI agent evaluation pipeline requires six steps: (1) Instrument full session traces — all turns, tool calls, and state changes connected by session ID. (2) Define product-specific success criteria — generic metrics like coherence and relevance are insufficient; define what "accomplished" means for your specific agent. (3) Run evaluations in CI/CD — block deployments on critical criteria (e.g., hallucinated policy rate below 1%). (4) Detect regression post-deployment — use statistical comparison (Welch's t-test) against pre-deployment baselines. (5) Annotate production failures — domain expert review of anomaly-prioritized traces creates ground truth. (6) Auto-generate evals from annotated failures — convert production incidents to pre-deployment test cases automatically.
What is the most dangerous AI agent failure mode and how do you detect it?
The most dangerous AI agent failure mode is tool response misinterpretation: the tool returns a valid response, the agent interprets it incorrectly, and all downstream reasoning proceeds from a wrong premise — without any errors appearing in logs. A real-world example: a billing API returns "credit pending" but the agent interprets it as "credit applied" and confidently informs the user. The tool call returned a 200. Every LLM call was syntactically correct. Detection requires capturing not just the tool call outcome but the agent's subsequent reasoning — comparing what the agent claimed about the tool response to what the response actually said — which requires a judge that reads both the tool response and the agent's downstream actions.
Latitude's 30-day free trial (no credit card) and free self-hosted option let you start the annotation-to-eval loop from day one. Start your free trial →



