How to turn generic evaluation metrics into a useful starting point for AI Reliability.

César Miguelañez

It's common now with all the talk about AI reliability to come to a bit of a conundrum: you know quality matters, but you don't yet know which failures matter most. You deploy a handful of broad evaluations like toxicity, hallucination, response length, and hope they catch the important stuff. Often what you find is that either you get no triggers because your prompt already passes them (models are being constantly trained to be better at these) or that your product is failing in a million other ways that aren't captured by those default evals.
Generic evals are dogwater. They are drinkable, but they won't sustain your product beyond the beginning of its lifecycle. Failure modes are not universal standards. They are unique to the product. But the idea of a universal standard is not useless for AI reliability. That is the problem annotation queues propose to solve.
The problem with starting specific
Writing a precise evaluation from day one is a chicken-and-egg problem. A good eval needs examples of the failure it's trying to detect. What does a bad interaction actually look like in your system? You can't know that until you've been running in production long enough to observe it. And you can't observe it systematically without an evaluation.
Teams have the tendency to get around this with guesswork. They write eval scripts based on what they think will go wrong, test against a handful of hand-crafted examples, ship, and don't take another look after that. Months later they discover the eval has been quietly passing traces that any human would flag. The real failure pattern looked nothing like what they imagined.
Generic evals as triage
This is where system queues come into play. Every project ships with default checks covering universal failure categories: jailbreaking, user frustration, lazy responses, forgotten context, tool call errors, and more. These aren't trying to be your final monitoring layer. System queues are a net that captures general failures that get you started with annotation.
Think of it this way: annotation queues are the monitoring system, evaluations are the diagnosis. The system queues flag traces that might be sick. A human reviewer opens each one and decides what's actually wrong. That reviewer's annotation, a thumbs down and a short note like "agent confused product variants", is the real data. It's specific and grounded in a real interaction.
Annotations then are used to cluster similar failures together and surfaces them as named issues: "Agent misquotes return window," "Agent ignores uploaded documents," "Agent loops between tools without progress." From any issue, you can generate a monitoring evaluation — a script built from your annotated examples, optimized against human judgment before it touches live traffic. That's how you go from generic to specific without guessing.

Why this works better than guessing
The annotation queue converts volume into signal. Thousands of traces your generic evals flag become a few dozen annotated examples of a specific failure. Without this bridge, you're writing evals in the dark and wasting time.
Consider a concrete example. You notice your agent sometimes gives wrong pricing. You write an eval that checks responses against a pricing database. By all accounts that is a reasonable approach, but in production, the real failure is tricker: the agent quotes correct prices for the wrong product variant because it misreads which SKU the user is asking about. Your eval passes every time. The prices are technically correct.
With annotation queues, you'd catch this differently. The "Frustration" system queue flags traces where users correct the agent. Your reviewer annotates them: "agent confused product variants." Latitude clusters five similar annotations into an issue. You generate an eval that checks whether the agent identified the correct variant before quoting a price. That eval catches the real problem, because it was built from real examples of the real problem.
You couldn't have written it in advance. You didn't know the failure would look like variant confusion. You expected hallucinated features. Reality was different as it usually is.
The alignment bonus
Every annotation you create while reviewing generic eval results also becomes calibration data for the specific evals you generate later.
When a human and an automated evaluation both score the same trace, platforms like Latitude can compute alignment metrics, that tell you whether your eval agrees with human judgment. High alignment means it is trustworthy while low alignment means you need to add more annotations and let the optimizer run again.
The work you did to discover the issue also validates the eval that monitors it. Nothing is wasted.
In practice
Start with the default system queues. Set sampling to 5–10% of traffic. Assign someone to review each queue for 15 minutes a day. Don't overthink annotations, just make sure the context for the failure is there. A thumbs down and "agent gave outdated pricing" is enough. What is important is that the failure mode is captured in the annotation.
After you feel like you’ve done a good number of annotations and understand some of the issues your product has, check your Issues page. There will be failure patterns you didn't anticipate, named and clustered from your team's annotations. Pick the most impactful one for your customers’ experience and generate an evaluation. Watch its alignment metrics and and more annotations if alignment is low. Let the system re-optimize.
Repeat this cycle. Each iteration turns a vague worry into a precise, calibrated monitor. Over time your generic system queues matter less, not because you remove them, but because your specific evals now cover the failures that actually matter in your system.
Generic metrics are not the end all be all but thinking of them as annotation queues is a great starting point for designing reliable AI products.


