Learn a 4-step framework for LLM agent evaluation with manual trace review, binary scoring, common mistakes, and practical error analysis tips.

César Miguelañez

Shipping an LLM agent is easy compared with keeping one reliable.
Most production teams learn this the hard way. An agent answers a few demo prompts well, gains some tools, passes a growing test suite, and then fails in front of real users in ways nobody anticipated. The deeper problem is not just model quality. It is evaluation design.
The core argument in Matthew Kujava’s talk is simple and important: agent evaluation cannot be treated like a standard model benchmark. If your system plans across steps, calls tools, mutates state, and interacts with external systems, then judging only the final answer is not enough. You need to inspect behavior, not just outcomes.
For teams already running AI features in production, this is more than a methodological preference. It is a reliability practice. In this article, we’ll unpack that framework, explain why it matters, and add practical context for engineering teams trying to build evals that actually catch regressions.
Key Takeaways
Evaluate agents as systems, not as single responses. Review traces, tool calls, and intermediate steps - not only the final answer.
Start with manual error analysis before scaling automation. A team that has not looked closely at real failures will automate the wrong checks.
Use a simple pass/fail rubric for human review. Binary labels reduce ambiguity and speed up annotation.
Review at least a meaningful sample of real conversations. The video recommends a minimum of 50 full conversations with traces.
Build an error taxonomy from observed failures. Group recurring issues so your evals measure real risk, not imagined problems.
Treat evaluation time as engineering investment. Manual review is expensive, but production incidents are usually more expensive.
Be cautious with LLM-as-a-judge setups. They can be unstable, poorly calibrated, and biased toward certain answer styles.
Do not over-index on generic text similarity metrics. For agentic systems, they often miss the actual failure mode.
Assign a clear quality decision-maker. One domain expert should break labeling ties and keep standards consistent.
Automate after you understand the failures. Instrumentation and regression checks should follow observed patterns, not precede them.
The Real Evaluation Problem: Why Agents Are Different
Many teams still evaluate agents as if they were ordinary chat models: send in a prompt, compare the answer to an expected output, compute a score, move on.
That approach breaks down quickly in agentic systems.
An agent does more than generate text. It may:
decide whether to use a tool
choose among multiple tools
retrieve incomplete or noisy information
update memory or application state
perform multi-step reasoning
recover from earlier mistakes
produce a correct answer through a flawed process, or a flawed answer through a mostly correct process
This is why Kujava argues against treating agents like students taking a test. A student-style evaluation assumes the answer sheet is the main artifact. For agents, the process is part of the product.
That distinction matters in production. If an agent reaches the right answer by luck, using the wrong tool sequence or bad assumptions, it may still fail the next time inputs shift slightly. Conversely, a final-answer-only eval can mark a trace as a failure while hiding useful signal about which components are actually working.
For engineering leaders, this implies a change in mindset: agent eval is closer to systems debugging than benchmark scoring.
Why Final-Output Metrics Miss the Point
The talk strongly criticizes overreliance on standard similarity metrics such as BLEU or ROUGE. That criticism is well-founded for modern agent workflows.
These metrics were designed for narrow text comparison tasks. They can tell you whether two strings overlap. They cannot tell you whether an agent:
called the wrong API
failed to ground a claim in retrieved evidence
ignored user constraints
took an unnecessary detour
corrupted state
fabricated tool results
violated a business rule before landing on a plausible answer
For a production AI engineer, the practical issue is not that these metrics are mathematically bad. It is that they optimize for the wrong unit of analysis.
Even many sophisticated eval dashboards fall into a similar trap. They aggregate a pass rate or quality score while masking the shape of failures underneath. If your test suite reports 92% success, that number may feel reassuring. But if the remaining 8% includes data corruption, harmful escalation behavior, or silent retrieval misses, the average is not very useful.
A high top-line metric can coexist with severe operational risk.
The Four-Step Error Analysis Workflow
The most useful part of the talk is the concrete workflow for understanding agent behavior. It is deliberately manual at the start.
1. Review Real Conversations and Full Traces
The first step is to manually inspect a meaningful sample of interactions. The video suggests at least 50 full conversations, including traces and intermediate artifacts.
That means looking at:
prompts and user context
model outputs at each step
retrieval results
tool calls and arguments
tool return values
retries, loops, and dead ends
final response quality
This is the step many teams skip because it feels slow. But it is also the step that reveals how the system actually behaves.
In production settings, trace review often uncovers problems that never show up in synthetic test prompts, such as:
latent prompt conflicts
hidden dependency failures
bad retrieval ranking under long-tail queries
brittle tool selection logic
context-window overload
subtle UX issues where the user’s need was clear but the agent solved the wrong problem
If your team has not spent time reading traces, your eval stack is probably less mature than it looks.
2. Use Open Coding to Annotate What Happened
After reviewing each conversation, write down what you observed.
This is not formal taxonomy work yet. It is lightweight annotation: what happened, was the outcome acceptable, and what sequence led there?
In qualitative research, this is often called open coding. In engineering terms, think of it as raw failure logging before abstraction.
Useful notes might include:
"answered correctly but skipped mandatory verification step"
"retrieval returned relevant doc, but model ignored it"
"tool argument malformed"
"hallucinated after empty search result"
"user intent ambiguous; recovery was weak"
"correct refusal, but explanation too vague"
At this stage, the goal is not elegance. It is fidelity. You are preserving the evidence needed to identify patterns later.
3. Group Failures into a Taxonomy
Once enough traces are reviewed, recurring patterns emerge. This is when you build a taxonomy.
A good taxonomy turns messy observations into categories your team can act on. For example:
Possible failure buckets
Tool use errors
wrong tool selected
right tool, wrong arguments
tool output misunderstood
Retrieval failures
no relevant documents found
relevant documents found but ignored
stale or conflicting evidence
Reasoning and planning failures
premature conclusion
missed step in workflow
failure to recover after a bad intermediate result
Instruction-following failures
violated formatting constraint
ignored policy rule
missed user preference
User experience failures
unnecessarily verbose response
low-confidence answer presented as certain
lack of clarification when intent was ambiguous
This taxonomy is the bridge between manual review and scalable evaluation. It gives you a language for discussing failures across engineering, product, and leadership.
More importantly, it lets you create targeted regression checks based on what the system actually gets wrong.
4. Implement Fixes and Repeat After Major Changes
The last step is iteration.
After major changes - new tools, prompt redesigns, architecture updates, routing changes, retrieval tuning - you repeat the process. Not forever at the same intensity, but often enough to avoid drifting into false confidence.
This matters because agent systems are highly coupled. A seemingly isolated change can cause regressions elsewhere:
adding a new tool may alter tool selection behavior
changing prompts may affect refusal patterns
retrieval tuning may improve relevance for one domain while harming another
latency optimizations may reduce context available for planning
A mature eval culture treats every meaningful change as a potential redistribution of failures, not just an opportunity for improvement.
The Uncomfortable Trade-Off: Manual Review Is Slow
The hardest truth in the talk is also the most operationally relevant: good agent evaluation takes time.
There is no escaping this. Manual trace review is expensive. It consumes engineering or domain-expert attention. It feels slower than writing another test harness or scorecard.
But the alternative is usually worse: shipping blind.
Kujava frames evaluation time as an investment, and that framing is especially useful for teams with production accountability. When an agent is customer-facing or workflow-critical, the cost comparison is not "manual review versus no cost." It is:
manual review now
versusoutages, false answers, damaged trust, support burden, and emergency fixes later
For CTOs and heads of AI, this is a governance point as much as an engineering one. If reliability matters, evaluation needs explicit capacity allocation. It cannot survive as a side activity squeezed in after feature work.
A practical way to operationalize this is to define eval time in planning:
reserve reviewer hours in each sprint
require trace review before major launches
assign ownership for dataset maintenance and taxonomy updates
include post-change regression review as part of the definition of done
The exact percentage of time is not specified as a hard rule in the video, and it will vary by system maturity. But the broader message is clear: if nobody has time to inspect behavior, nobody really knows how reliable the agent is.
Why Binary Scoring Often Beats 1–5 Ratings
One of the most actionable recommendations in the talk is to simplify human grading.
Instead of using scales like 1–5 or 0.0–1.0, use a binary label:
satisfactory / not satisfactory
pass / fail
thumbs up / thumbs down
This advice may sound reductive, but it solves a real annotation problem. Continuous scales invite hesitation and inconsistency. Reviewers hide uncertainty in the middle:
"maybe this is a 3"
"perhaps 0.7"
"it’s not great, but not terrible"
That ambiguity slows review and weakens dataset quality.
Binary scoring forces a clearer question: Would this result be acceptable in production for this use case?
For teams building eval pipelines, binary labels also make downstream automation easier:
simpler consensus rules
cleaner trend tracking
clearer regression thresholds
less reviewer calibration overhead
This does not mean every nuance disappears. You can still preserve nuance in side notes and taxonomy tags. The binary label handles the decision; the tags capture the diagnosis.
That split is often more useful than a single blended score.
The Limits of LLM-as-a-Judge
The talk also cautions against depending too heavily on LLM judges, especially early in the process.
This is a timely warning. Many teams are attracted to LLM-as-a-judge because it appears scalable: write a rubric, run a model over outputs, compute scores, monitor trends.
The problem is not that LLM judges are useless. The problem is that they can be unstable and poorly calibrated.
Common failure modes include:
different scores across repeated runs
overly strict grading at the top end
poor discrimination between clearly wrong and partially correct outputs
bias toward answers that resemble the judge model’s own style
weak handling of domain-specific quality criteria
failure to reason reliably about multi-step traces
The video also mentions affinity bias: judging outputs with a model that shares tendencies with the model being evaluated can distort results. That is a real operational concern. A judge may reward familiar phrasing or reasoning patterns rather than actual usefulness.
A balanced takeaway for production teams is this:
Use LLM judges after you understand your failure modes.
Use them for narrow tasks where rubrics are explicit.
Validate them against human labels before trusting them.
Avoid using them as a substitute for early-stage manual discovery.
They are best seen as force multipliers, not primary truth sources.
Tooling: Use Enough Observability, Not Maximum Observability
Another practical point in the talk is tool choice.
There are now many observability and annotation platforms for LLM systems. They can be powerful, especially for teams that need collaboration, trace search, experiment tracking, and integrated eval workflows.
But more tooling is not always better.
If the platform is too heavy for your current stage, it can slow the team down or obscure what matters. For some teams, a lightweight custom UI may be more effective than a full platform.
That is a valuable reminder. The right question is not "What is the most advanced observability stack?" It is:
What setup lets our team quickly inspect traces, annotate failures, and learn from real usage?
For an early or mid-stage production team, the minimum useful toolset often includes:
trace capture for prompts, tool calls, and outputs
filtering by scenario or failure type
easy annotation workflow
linkage between traces and code/version changes
visibility into user context and system state where safe and allowed
If a simpler interface gets your reviewers to spend more time on actual analysis, it may be better than a richer platform nobody consistently uses.
The Case for a "Benevolent Dictator" in Eval
One of the more interesting recommendations is to appoint a single final decision-maker for quality judgments.
That may sound uncomfortably centralized, but the logic is strong. Multi-reviewer annotation often collapses into long debates over edge cases. If five engineers each apply slightly different quality standards, your labels become inconsistent and your metrics become noisy.
A designated domain expert can:
define what "good enough" means
resolve disagreements quickly
preserve consistency over time
keep the taxonomy aligned with product reality
This does not eliminate collaboration. It simply creates a tie-breaker and quality anchor.
For larger organizations, this role can evolve into a small evaluation council. But even then, one accountable owner should usually make final calls on rubric interpretation.
Reliability improves when standards are explicit and stable.
Three Industry Anti-Patterns to Avoid
The talk closes with a useful set of anti-patterns. Each is worth expanding because they show up constantly in real teams.
1. Chasing Generic Metrics
When teams rely on standard text overlap scores or shallow pass rates, they confuse measurement with understanding.
Generic metrics are attractive because they are easy to compute and easy to put on dashboards. But if they do not reflect true product risk, they create false confidence.
A better approach is to derive metrics from your taxonomy:
tool-selection accuracy
retrieval grounding rate
policy compliance pass/fail
successful recovery after a failed tool call
clarification rate for ambiguous requests
These metrics are more work to define, but they map to actual behavior.
2. Outsourcing Annotation Too Early
External annotation can be useful at scale, but outsourcing too soon creates distance between the builders and the failures.
If your own team does not yet understand how the model behaves, external raters will not solve that problem. They may even make it worse by introducing inconsistent standards or stripping context from traces.
Internal review builds product intuition. It teaches engineers what breaks, what matters, and what users actually experience.
A sensible progression is:
internal manual review
internal taxonomy development
internal rubric stabilization
selective external support once quality standards are clear
That order preserves learning while still allowing scale later.
3. Over-Automating Before You Understand the Data
This may be the most common mistake in AI teams that are otherwise highly competent.
The instinct is understandable: automate first-pass labeling, cluster failures with an LLM, generate dashboards, and save human effort.
But if you automate before developing grounded understanding, your automation will mirror your confusion. Broad clusters hide root causes. vague scores mask uncertainty. synthetic labels become a layer of noise over already opaque systems.
Automation should accelerate known workflows, not replace first-principles learning.
A More Useful Mental Model: Evaluate Observed Failures, Not Imagined Ones
One of the strongest ideas in the talk is a rejection of "eval-driven development" when that means writing increasingly elaborate checks for hypothetical failures before you have seen them in data.
This is an important nuance.
Of course teams need proactive testing for obvious invariants, policies, and safety constraints. But beyond that, many eval suites become speculative. Engineers invent dozens of tests for edge cases they assume matter, while actual production failures emerge elsewhere.
The more durable strategy is:
inspect real traces
identify actual failure modes
build tests and guards around those failures
re-run after changes
update the taxonomy as behavior evolves
This is closer to incident-informed reliability engineering than to classic benchmark construction.
For teams running agents in production, that approach has a major advantage: it keeps evaluation tethered to user reality.
What This Means for Production AI Teams
If you already operate LLM features in production, the talk suggests a concrete shift in practice.
For engineers
instrument traces deeply enough to reconstruct decisions
review real sessions regularly
create binary labels plus structured failure tags
build regression suites from observed failures
validate any automated judge against human-reviewed samples
For tech leads
make evaluation a planned activity, not spare-time work
define quality ownership clearly
ensure that tooling supports trace review, not just reporting
resist pressure to summarize everything into one score
For CTOs and heads of AI
treat manual eval capacity as part of reliability budget
ask for failure taxonomies, not just benchmark numbers
require post-change analysis for significant architecture updates
use eval maturity as a governance signal for launch readiness
In mature teams, evaluation becomes a feedback system connecting model behavior, product risk, and engineering decisions. That is a much stronger foundation than any standalone leaderboard metric.
A Practical Starting Blueprint
If your team wants to operationalize the advice from the talk, a simple first version could look like this:
Week 1: Establish visibility
capture full traces for a representative slice of production traffic
select 50+ conversations across common and risky scenarios
create a lightweight review interface if needed
Week 2: Manual review
label each trace pass/fail
add free-form notes about what happened
identify recurring error patterns
Week 3: Build taxonomy
consolidate notes into 5–10 failure categories
define examples for each category
appoint one owner to resolve labeling ambiguity
Week 4: Turn findings into evals
create targeted regression tests for top failure classes
add monitoring dimensions that reflect the taxonomy
decide where automation is justified and where humans stay in the loop
This is not a complete reliability program, but it is a strong starting point - and notably more grounded than beginning with generic scoring pipelines.
Conclusion
The central lesson from the video is not that automation is bad or metrics are useless. It is that agent reliability begins with direct observation.
If you want to evaluate LLM agents well, you have to inspect how they behave in the wild: the steps they take, the tools they use, the mistakes they repeat, and the conditions under which they break. From there, you build a taxonomy, create targeted tests, and automate only what you actually understand.
That is slower than chasing a single score. It is also far more likely to prevent regressions, catch real failures, and produce systems your users can trust.
For production AI teams, that trade-off is usually worth making.
Source: "LLM Evaluation in Practice: Error Analysis and Reliable Agent Testing" - deepsense, YouTube, Apr 16, 2026 - https://www.youtube.com/watch?v=VWrPtb5eWH4



