>

How to Evaluate LLM Agents: Practical Error Analysis

How to Evaluate LLM Agents: Practical Error Analysis

How to Evaluate LLM Agents: Practical Error Analysis

Learn a 4-step framework for LLM agent evaluation with manual trace review, binary scoring, common mistakes, and practical error analysis tips.

César Miguelañez

Shipping an LLM agent is easy compared with keeping one reliable.

Most production teams learn this the hard way. An agent answers a few demo prompts well, gains some tools, passes a growing test suite, and then fails in front of real users in ways nobody anticipated. The deeper problem is not just model quality. It is evaluation design.

The core argument in Matthew Kujava’s talk is simple and important: agent evaluation cannot be treated like a standard model benchmark. If your system plans across steps, calls tools, mutates state, and interacts with external systems, then judging only the final answer is not enough. You need to inspect behavior, not just outcomes.

For teams already running AI features in production, this is more than a methodological preference. It is a reliability practice. In this article, we’ll unpack that framework, explain why it matters, and add practical context for engineering teams trying to build evals that actually catch regressions.

Key Takeaways

  • Evaluate agents as systems, not as single responses. Review traces, tool calls, and intermediate steps - not only the final answer.

  • Start with manual error analysis before scaling automation. A team that has not looked closely at real failures will automate the wrong checks.

  • Use a simple pass/fail rubric for human review. Binary labels reduce ambiguity and speed up annotation.

  • Review at least a meaningful sample of real conversations. The video recommends a minimum of 50 full conversations with traces.

  • Build an error taxonomy from observed failures. Group recurring issues so your evals measure real risk, not imagined problems.

  • Treat evaluation time as engineering investment. Manual review is expensive, but production incidents are usually more expensive.

  • Be cautious with LLM-as-a-judge setups. They can be unstable, poorly calibrated, and biased toward certain answer styles.

  • Do not over-index on generic text similarity metrics. For agentic systems, they often miss the actual failure mode.

  • Assign a clear quality decision-maker. One domain expert should break labeling ties and keep standards consistent.

  • Automate after you understand the failures. Instrumentation and regression checks should follow observed patterns, not precede them.

The Real Evaluation Problem: Why Agents Are Different

Many teams still evaluate agents as if they were ordinary chat models: send in a prompt, compare the answer to an expected output, compute a score, move on.

That approach breaks down quickly in agentic systems.

An agent does more than generate text. It may:

  • decide whether to use a tool

  • choose among multiple tools

  • retrieve incomplete or noisy information

  • update memory or application state

  • perform multi-step reasoning

  • recover from earlier mistakes

  • produce a correct answer through a flawed process, or a flawed answer through a mostly correct process

This is why Kujava argues against treating agents like students taking a test. A student-style evaluation assumes the answer sheet is the main artifact. For agents, the process is part of the product.

That distinction matters in production. If an agent reaches the right answer by luck, using the wrong tool sequence or bad assumptions, it may still fail the next time inputs shift slightly. Conversely, a final-answer-only eval can mark a trace as a failure while hiding useful signal about which components are actually working.

For engineering leaders, this implies a change in mindset: agent eval is closer to systems debugging than benchmark scoring.

Why Final-Output Metrics Miss the Point

The talk strongly criticizes overreliance on standard similarity metrics such as BLEU or ROUGE. That criticism is well-founded for modern agent workflows.

These metrics were designed for narrow text comparison tasks. They can tell you whether two strings overlap. They cannot tell you whether an agent:

  • called the wrong API

  • failed to ground a claim in retrieved evidence

  • ignored user constraints

  • took an unnecessary detour

  • corrupted state

  • fabricated tool results

  • violated a business rule before landing on a plausible answer

For a production AI engineer, the practical issue is not that these metrics are mathematically bad. It is that they optimize for the wrong unit of analysis.

Even many sophisticated eval dashboards fall into a similar trap. They aggregate a pass rate or quality score while masking the shape of failures underneath. If your test suite reports 92% success, that number may feel reassuring. But if the remaining 8% includes data corruption, harmful escalation behavior, or silent retrieval misses, the average is not very useful.

A high top-line metric can coexist with severe operational risk.

The Four-Step Error Analysis Workflow

The most useful part of the talk is the concrete workflow for understanding agent behavior. It is deliberately manual at the start.

1. Review Real Conversations and Full Traces

The first step is to manually inspect a meaningful sample of interactions. The video suggests at least 50 full conversations, including traces and intermediate artifacts.

That means looking at:

  • prompts and user context

  • model outputs at each step

  • retrieval results

  • tool calls and arguments

  • tool return values

  • retries, loops, and dead ends

  • final response quality

This is the step many teams skip because it feels slow. But it is also the step that reveals how the system actually behaves.

In production settings, trace review often uncovers problems that never show up in synthetic test prompts, such as:

  • latent prompt conflicts

  • hidden dependency failures

  • bad retrieval ranking under long-tail queries

  • brittle tool selection logic

  • context-window overload

  • subtle UX issues where the user’s need was clear but the agent solved the wrong problem

If your team has not spent time reading traces, your eval stack is probably less mature than it looks.

2. Use Open Coding to Annotate What Happened

After reviewing each conversation, write down what you observed.

This is not formal taxonomy work yet. It is lightweight annotation: what happened, was the outcome acceptable, and what sequence led there?

In qualitative research, this is often called open coding. In engineering terms, think of it as raw failure logging before abstraction.

Useful notes might include:

  • "answered correctly but skipped mandatory verification step"

  • "retrieval returned relevant doc, but model ignored it"

  • "tool argument malformed"

  • "hallucinated after empty search result"

  • "user intent ambiguous; recovery was weak"

  • "correct refusal, but explanation too vague"

At this stage, the goal is not elegance. It is fidelity. You are preserving the evidence needed to identify patterns later.

3. Group Failures into a Taxonomy

Once enough traces are reviewed, recurring patterns emerge. This is when you build a taxonomy.

A good taxonomy turns messy observations into categories your team can act on. For example:

Possible failure buckets

  • Tool use errors

    • wrong tool selected

    • right tool, wrong arguments

    • tool output misunderstood

  • Retrieval failures

    • no relevant documents found

    • relevant documents found but ignored

    • stale or conflicting evidence

  • Reasoning and planning failures

    • premature conclusion

    • missed step in workflow

    • failure to recover after a bad intermediate result

  • Instruction-following failures

    • violated formatting constraint

    • ignored policy rule

    • missed user preference

  • User experience failures

    • unnecessarily verbose response

    • low-confidence answer presented as certain

    • lack of clarification when intent was ambiguous

This taxonomy is the bridge between manual review and scalable evaluation. It gives you a language for discussing failures across engineering, product, and leadership.

More importantly, it lets you create targeted regression checks based on what the system actually gets wrong.

4. Implement Fixes and Repeat After Major Changes

The last step is iteration.

After major changes - new tools, prompt redesigns, architecture updates, routing changes, retrieval tuning - you repeat the process. Not forever at the same intensity, but often enough to avoid drifting into false confidence.

This matters because agent systems are highly coupled. A seemingly isolated change can cause regressions elsewhere:

  • adding a new tool may alter tool selection behavior

  • changing prompts may affect refusal patterns

  • retrieval tuning may improve relevance for one domain while harming another

  • latency optimizations may reduce context available for planning

A mature eval culture treats every meaningful change as a potential redistribution of failures, not just an opportunity for improvement.

The Uncomfortable Trade-Off: Manual Review Is Slow

The hardest truth in the talk is also the most operationally relevant: good agent evaluation takes time.

There is no escaping this. Manual trace review is expensive. It consumes engineering or domain-expert attention. It feels slower than writing another test harness or scorecard.

But the alternative is usually worse: shipping blind.

Kujava frames evaluation time as an investment, and that framing is especially useful for teams with production accountability. When an agent is customer-facing or workflow-critical, the cost comparison is not "manual review versus no cost." It is:

  • manual review now
    versus

  • outages, false answers, damaged trust, support burden, and emergency fixes later

For CTOs and heads of AI, this is a governance point as much as an engineering one. If reliability matters, evaluation needs explicit capacity allocation. It cannot survive as a side activity squeezed in after feature work.

A practical way to operationalize this is to define eval time in planning:

  • reserve reviewer hours in each sprint

  • require trace review before major launches

  • assign ownership for dataset maintenance and taxonomy updates

  • include post-change regression review as part of the definition of done

The exact percentage of time is not specified as a hard rule in the video, and it will vary by system maturity. But the broader message is clear: if nobody has time to inspect behavior, nobody really knows how reliable the agent is.

Why Binary Scoring Often Beats 1–5 Ratings

One of the most actionable recommendations in the talk is to simplify human grading.

Instead of using scales like 1–5 or 0.0–1.0, use a binary label:

  • satisfactory / not satisfactory

  • pass / fail

  • thumbs up / thumbs down

This advice may sound reductive, but it solves a real annotation problem. Continuous scales invite hesitation and inconsistency. Reviewers hide uncertainty in the middle:

  • "maybe this is a 3"

  • "perhaps 0.7"

  • "it’s not great, but not terrible"

That ambiguity slows review and weakens dataset quality.

Binary scoring forces a clearer question: Would this result be acceptable in production for this use case?

For teams building eval pipelines, binary labels also make downstream automation easier:

  • simpler consensus rules

  • cleaner trend tracking

  • clearer regression thresholds

  • less reviewer calibration overhead

This does not mean every nuance disappears. You can still preserve nuance in side notes and taxonomy tags. The binary label handles the decision; the tags capture the diagnosis.

That split is often more useful than a single blended score.

The Limits of LLM-as-a-Judge

The talk also cautions against depending too heavily on LLM judges, especially early in the process.

This is a timely warning. Many teams are attracted to LLM-as-a-judge because it appears scalable: write a rubric, run a model over outputs, compute scores, monitor trends.

The problem is not that LLM judges are useless. The problem is that they can be unstable and poorly calibrated.

Common failure modes include:

  • different scores across repeated runs

  • overly strict grading at the top end

  • poor discrimination between clearly wrong and partially correct outputs

  • bias toward answers that resemble the judge model’s own style

  • weak handling of domain-specific quality criteria

  • failure to reason reliably about multi-step traces

The video also mentions affinity bias: judging outputs with a model that shares tendencies with the model being evaluated can distort results. That is a real operational concern. A judge may reward familiar phrasing or reasoning patterns rather than actual usefulness.

A balanced takeaway for production teams is this:

  • Use LLM judges after you understand your failure modes.

  • Use them for narrow tasks where rubrics are explicit.

  • Validate them against human labels before trusting them.

  • Avoid using them as a substitute for early-stage manual discovery.

They are best seen as force multipliers, not primary truth sources.

Tooling: Use Enough Observability, Not Maximum Observability

Another practical point in the talk is tool choice.

There are now many observability and annotation platforms for LLM systems. They can be powerful, especially for teams that need collaboration, trace search, experiment tracking, and integrated eval workflows.

But more tooling is not always better.

If the platform is too heavy for your current stage, it can slow the team down or obscure what matters. For some teams, a lightweight custom UI may be more effective than a full platform.

That is a valuable reminder. The right question is not "What is the most advanced observability stack?" It is:

What setup lets our team quickly inspect traces, annotate failures, and learn from real usage?

For an early or mid-stage production team, the minimum useful toolset often includes:

  • trace capture for prompts, tool calls, and outputs

  • filtering by scenario or failure type

  • easy annotation workflow

  • linkage between traces and code/version changes

  • visibility into user context and system state where safe and allowed

If a simpler interface gets your reviewers to spend more time on actual analysis, it may be better than a richer platform nobody consistently uses.

The Case for a "Benevolent Dictator" in Eval

One of the more interesting recommendations is to appoint a single final decision-maker for quality judgments.

That may sound uncomfortably centralized, but the logic is strong. Multi-reviewer annotation often collapses into long debates over edge cases. If five engineers each apply slightly different quality standards, your labels become inconsistent and your metrics become noisy.

A designated domain expert can:

  • define what "good enough" means

  • resolve disagreements quickly

  • preserve consistency over time

  • keep the taxonomy aligned with product reality

This does not eliminate collaboration. It simply creates a tie-breaker and quality anchor.

For larger organizations, this role can evolve into a small evaluation council. But even then, one accountable owner should usually make final calls on rubric interpretation.

Reliability improves when standards are explicit and stable.

Three Industry Anti-Patterns to Avoid

The talk closes with a useful set of anti-patterns. Each is worth expanding because they show up constantly in real teams.

1. Chasing Generic Metrics

When teams rely on standard text overlap scores or shallow pass rates, they confuse measurement with understanding.

Generic metrics are attractive because they are easy to compute and easy to put on dashboards. But if they do not reflect true product risk, they create false confidence.

A better approach is to derive metrics from your taxonomy:

  • tool-selection accuracy

  • retrieval grounding rate

  • policy compliance pass/fail

  • successful recovery after a failed tool call

  • clarification rate for ambiguous requests

These metrics are more work to define, but they map to actual behavior.

2. Outsourcing Annotation Too Early

External annotation can be useful at scale, but outsourcing too soon creates distance between the builders and the failures.

If your own team does not yet understand how the model behaves, external raters will not solve that problem. They may even make it worse by introducing inconsistent standards or stripping context from traces.

Internal review builds product intuition. It teaches engineers what breaks, what matters, and what users actually experience.

A sensible progression is:

  1. internal manual review

  2. internal taxonomy development

  3. internal rubric stabilization

  4. selective external support once quality standards are clear

That order preserves learning while still allowing scale later.

3. Over-Automating Before You Understand the Data

This may be the most common mistake in AI teams that are otherwise highly competent.

The instinct is understandable: automate first-pass labeling, cluster failures with an LLM, generate dashboards, and save human effort.

But if you automate before developing grounded understanding, your automation will mirror your confusion. Broad clusters hide root causes. vague scores mask uncertainty. synthetic labels become a layer of noise over already opaque systems.

Automation should accelerate known workflows, not replace first-principles learning.

A More Useful Mental Model: Evaluate Observed Failures, Not Imagined Ones

One of the strongest ideas in the talk is a rejection of "eval-driven development" when that means writing increasingly elaborate checks for hypothetical failures before you have seen them in data.

This is an important nuance.

Of course teams need proactive testing for obvious invariants, policies, and safety constraints. But beyond that, many eval suites become speculative. Engineers invent dozens of tests for edge cases they assume matter, while actual production failures emerge elsewhere.

The more durable strategy is:

  1. inspect real traces

  2. identify actual failure modes

  3. build tests and guards around those failures

  4. re-run after changes

  5. update the taxonomy as behavior evolves

This is closer to incident-informed reliability engineering than to classic benchmark construction.

For teams running agents in production, that approach has a major advantage: it keeps evaluation tethered to user reality.

What This Means for Production AI Teams

If you already operate LLM features in production, the talk suggests a concrete shift in practice.

For engineers

  • instrument traces deeply enough to reconstruct decisions

  • review real sessions regularly

  • create binary labels plus structured failure tags

  • build regression suites from observed failures

  • validate any automated judge against human-reviewed samples

For tech leads

  • make evaluation a planned activity, not spare-time work

  • define quality ownership clearly

  • ensure that tooling supports trace review, not just reporting

  • resist pressure to summarize everything into one score

For CTOs and heads of AI

  • treat manual eval capacity as part of reliability budget

  • ask for failure taxonomies, not just benchmark numbers

  • require post-change analysis for significant architecture updates

  • use eval maturity as a governance signal for launch readiness

In mature teams, evaluation becomes a feedback system connecting model behavior, product risk, and engineering decisions. That is a much stronger foundation than any standalone leaderboard metric.

A Practical Starting Blueprint

If your team wants to operationalize the advice from the talk, a simple first version could look like this:

Week 1: Establish visibility

  • capture full traces for a representative slice of production traffic

  • select 50+ conversations across common and risky scenarios

  • create a lightweight review interface if needed

Week 2: Manual review

  • label each trace pass/fail

  • add free-form notes about what happened

  • identify recurring error patterns

Week 3: Build taxonomy

  • consolidate notes into 5–10 failure categories

  • define examples for each category

  • appoint one owner to resolve labeling ambiguity

Week 4: Turn findings into evals

  • create targeted regression tests for top failure classes

  • add monitoring dimensions that reflect the taxonomy

  • decide where automation is justified and where humans stay in the loop

This is not a complete reliability program, but it is a strong starting point - and notably more grounded than beginning with generic scoring pipelines.

Conclusion

The central lesson from the video is not that automation is bad or metrics are useless. It is that agent reliability begins with direct observation.

If you want to evaluate LLM agents well, you have to inspect how they behave in the wild: the steps they take, the tools they use, the mistakes they repeat, and the conditions under which they break. From there, you build a taxonomy, create targeted tests, and automate only what you actually understand.

That is slower than chasing a single score. It is also far more likely to prevent regressions, catch real failures, and produce systems your users can trust.

For production AI teams, that trade-off is usually worth making.

Source: "LLM Evaluation in Practice: Error Analysis and Reliable Agent Testing" - deepsense, YouTube, Apr 16, 2026 - https://www.youtube.com/watch?v=VWrPtb5eWH4

Related Blog Posts

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.