AI Evaluation for ML Engineers: Production-Based Eval Methodology
AI Evaluation for ML Engineers: Production-Based Eval Methodology
AI Evaluation for ML Engineers: Production-Based Eval Methodology
AI evaluation for ML engineers: production-based eval methodology, LLM-as-judge calibration with MCC, GEPA eval generation, and statistical frameworks for measuring AI quality in production.
Eval design starts with failure mode taxonomy, not metric selection. Choosing metrics before understanding what can go wrong produces evals that measure the wrong things precisely.
LLM-as-judge evals require calibration against human annotations before being deployed as quality gates. MCC is the correct alignment metric — accuracy is misleading in the class-imbalanced production data regime.
Sampling strategy is a first-class design decision: 100% for safety/critical checks, 10–30% for semantic quality, 5% for secondary dimensions. Wrong sampling rates waste compute or miss critical failures.
The offline/online split is necessary but the connection between them is the key: online failures should automatically feed the offline eval dataset via the annotation pipeline.
GEPA closes the manual bottleneck in eval generation — annotation effort compounds into an automatically growing eval library rather than requiring an engineer to author each test case.
For ML engineers, AI evaluation is both a methodology problem and a data engineering problem. The methodology determines which failures you can detect and how reliably. The data engineering determines whether the evaluation system is maintainable at scale without requiring constant manual intervention.
This guide covers both — the eval design methodology for production AI systems and the data pipeline architecture that makes it sustainable.
Failure Mode Taxonomy First
The most common mistake in eval design: choosing evaluation metrics before understanding the failure mode profile. Teams that start with "we'll use BLEU score" or "we'll use an LLM judge for quality" before analyzing what actually goes wrong in their production system end up measuring the wrong things precisely.
Start with a failure mode taxonomy. Review 50–100 production traces. Identify and name every distinct category of thing that goes wrong. Common categories for agent systems:
Tool selection errors: Agent calls wrong tool, fails to call tool when needed
Constraint violation: Agent violates an explicit constraint established earlier in the session
Goal drift: Agent pursues a subtask in a way that contradicts the original user goal
Hallucination: Agent asserts facts not present in context, memory, or retrieved data
Scope violation: Agent responds to out-of-scope requests rather than declining appropriately
Context loss: Agent forgets context from early turns in long sessions
Premature termination: Agent concludes the session before the user's goal is accomplished
Your taxonomy will differ from this. The point is to have it before choosing eval types — because the right eval type depends on the failure mode structure.
Eval Type Selection by Failure Mode
Failure Mode
Eval Type
Why
Tool selection (wrong tool)
Rule-based
Ground truth exists: was the correct tool called? Deterministic check.
Requires tracking constraint from earlier turn to later turn — semantic understanding.
Hallucination
LLM judge + human annotation
LLM judge for scalable detection; human annotation for calibration and high-severity cases.
Goal drift
LLM judge (session-level)
Requires understanding original goal and whether final state achieved it — session-level.
Context loss
LLM judge + session segmentation
Compare early-turn constraints to late-turn behavior — requires full session trace.
LLM Judge Design and Calibration
Judge prompt design principles
An LLM judge is a prompt-engineered evaluator. The quality of the judge depends on the prompt quality. Common failure modes in judge design:
Sycophancy: The judge tends to rate responses positively if they're confident and well-written, regardless of accuracy. Mitigation: include explicit instructions not to reward confidence, and include examples of confident but incorrect responses that should fail.
Position bias: The judge prefers responses earlier or later in the context when asked to compare. For comparative evals, randomize position and take the average across both orderings.
Length bias: The judge rewards longer responses. Include explicit instructions that length is not a quality criterion and include examples of short correct responses outscoring long incorrect ones.
Vagueness in criteria: "Was the response high quality?" produces inconsistent verdicts. Replace with specific criteria: "Did the agent correctly identify the user's primary goal by turn 3?"
MCC calibration against human annotations
importnumpyas npfrom sklearn.metricsimportmatthews_corrcoeffromitertools importproductdef calibrate_judge(judge_fn,annotated_dataset,thresholds=None):"""
Find optimal threshold and report MCC forLLM judge vs. humanannotations.
Args:
judge_fn:callable(trace) -> float(0-1 score)
annotated_dataset:list of {trace:dict,label:int}(1=pass,0=fail)
thresholds:list of thresholds to test(default:0.3 to 0.8in0.05steps)"""
ifthresholds is None:thresholds = np.arange(0.3,0.85,0.05)judge_scores = [judge_fn(d["trace"])for d inannotated_dataset]human_labels = [d["label"]for d inannotated_dataset]results = []forthresholdinthresholds:judge_labels = [1ifs >= threshold else0forsinjudge_scores]mcc = matthews_corrcoef(human_labels,judge_labels)
# Confusion matrixtp = sum(j==1and h==1forj,hinzip(judge_labels,human_labels))tn = sum(j==0and h==0forj,hinzip(judge_labels,human_labels))fp = sum(j==1and h==0forj,hinzip(judge_labels,human_labels))fn = sum(j==0and h==1forj,hinzip(judge_labels,human_labels))results.append({"threshold":threshold,"mcc":mcc,"fnr":fn / (fn + tp)if(fn + tp) > 0else 0, # Falsenegative rate"fpr":fp / (fp + tn)if(fp + tn) > 0else0, # False positive rate})best = max(results,key=lambda x:x["mcc"])class_balance = sum(human_labels) / len(human_labels)return{"best_threshold":best["threshold"],"best_mcc":best["mcc"],"best_fnr":best["fnr"],"best_fpr":best["fpr"],"class_balance":class_balance,"sample_size":len(annotated_dataset),"deployment_recommendation":("deploy as gate"if best["mcc"] >= 0.6else"monitor only"ifbest["mcc"] >= 0.4else"do not deploy — insufficient alignment with human judgment"),"all_thresholds":results}
importnumpyas npfrom sklearn.metricsimportmatthews_corrcoeffromitertools importproductdef calibrate_judge(judge_fn,annotated_dataset,thresholds=None):"""
Find optimal threshold and report MCC forLLM judge vs. humanannotations.
Args:
judge_fn:callable(trace) -> float(0-1 score)
annotated_dataset:list of {trace:dict,label:int}(1=pass,0=fail)
thresholds:list of thresholds to test(default:0.3 to 0.8in0.05steps)"""
ifthresholds is None:thresholds = np.arange(0.3,0.85,0.05)judge_scores = [judge_fn(d["trace"])for d inannotated_dataset]human_labels = [d["label"]for d inannotated_dataset]results = []forthresholdinthresholds:judge_labels = [1ifs >= threshold else0forsinjudge_scores]mcc = matthews_corrcoef(human_labels,judge_labels)
# Confusion matrixtp = sum(j==1and h==1forj,hinzip(judge_labels,human_labels))tn = sum(j==0and h==0forj,hinzip(judge_labels,human_labels))fp = sum(j==1and h==0forj,hinzip(judge_labels,human_labels))fn = sum(j==0and h==1forj,hinzip(judge_labels,human_labels))results.append({"threshold":threshold,"mcc":mcc,"fnr":fn / (fn + tp)if(fn + tp) > 0else 0, # Falsenegative rate"fpr":fp / (fp + tn)if(fp + tn) > 0else0, # False positive rate})best = max(results,key=lambda x:x["mcc"])class_balance = sum(human_labels) / len(human_labels)return{"best_threshold":best["threshold"],"best_mcc":best["mcc"],"best_fnr":best["fnr"],"best_fpr":best["fpr"],"class_balance":class_balance,"sample_size":len(annotated_dataset),"deployment_recommendation":("deploy as gate"if best["mcc"] >= 0.6else"monitor only"ifbest["mcc"] >= 0.4else"do not deploy — insufficient alignment with human judgment"),"all_thresholds":results}
importnumpyas npfrom sklearn.metricsimportmatthews_corrcoeffromitertools importproductdef calibrate_judge(judge_fn,annotated_dataset,thresholds=None):"""
Find optimal threshold and report MCC forLLM judge vs. humanannotations.
Args:
judge_fn:callable(trace) -> float(0-1 score)
annotated_dataset:list of {trace:dict,label:int}(1=pass,0=fail)
thresholds:list of thresholds to test(default:0.3 to 0.8in0.05steps)"""
ifthresholds is None:thresholds = np.arange(0.3,0.85,0.05)judge_scores = [judge_fn(d["trace"])for d inannotated_dataset]human_labels = [d["label"]for d inannotated_dataset]results = []forthresholdinthresholds:judge_labels = [1ifs >= threshold else0forsinjudge_scores]mcc = matthews_corrcoef(human_labels,judge_labels)
# Confusion matrixtp = sum(j==1and h==1forj,hinzip(judge_labels,human_labels))tn = sum(j==0and h==0forj,hinzip(judge_labels,human_labels))fp = sum(j==1and h==0forj,hinzip(judge_labels,human_labels))fn = sum(j==0and h==1forj,hinzip(judge_labels,human_labels))results.append({"threshold":threshold,"mcc":mcc,"fnr":fn / (fn + tp)if(fn + tp) > 0else 0, # Falsenegative rate"fpr":fp / (fp + tn)if(fp + tn) > 0else0, # False positive rate})best = max(results,key=lambda x:x["mcc"])class_balance = sum(human_labels) / len(human_labels)return{"best_threshold":best["threshold"],"best_mcc":best["mcc"],"best_fnr":best["fnr"],"best_fpr":best["fpr"],"class_balance":class_balance,"sample_size":len(annotated_dataset),"deployment_recommendation":("deploy as gate"if best["mcc"] >= 0.6else"monitor only"ifbest["mcc"] >= 0.4else"do not deploy — insufficient alignment with human judgment"),"all_thresholds":results}
Minimum annotation sample size
MCC estimates are unstable with small samples. Minimum sample sizes by expected class balance:
90/10 split: 200 annotations minimum; 500 for reliable estimate
95/5 split: 500 annotations minimum — consider oversampling failures if available
If you don't have enough annotations to calibrate reliably, use the judge in monitoring-only mode (no deployment blocking) until the annotation dataset is large enough.
The Offline/Online Architecture
Offline eval dataset management
The offline eval dataset should contain:
Production sessions that represent the failure modes in your taxonomy (typically 20–50 examples per failure mode)
Production sessions that represent nominal good performance (to catch false positive regressions)
Adversarial cases for high-severity failure modes (edge cases that should fail)
Dataset hygiene: review the dataset quarterly. Remove sessions that are no longer representative of current production patterns. Add sessions from recent production failures. Maintain roughly 70% production-sampled / 30% curated adversarial cases.
How do ML engineers design evaluations for production AI systems?
Production AI evaluation design for ML engineers requires: (1) Starting with failure mode taxonomy — categorize what can go wrong in your specific system before writing any eval. (2) Choosing eval type by failure category — rule-based for structural requirements, LLM-as-judge for semantic quality. (3) Calibrating LLM judges against human annotations using MCC before deploying them as quality gates. (4) Setting sampling rates per eval — 100% for critical safety checks, 10–30% for semantic quality evals. (5) Building a measurement loop — eval pass rates are tracked as time series, and regressions trigger annotation queue prioritization for the next cycle.
Why is MCC better than accuracy for evaluating LLM judges?
Matthews Correlation Coefficient (MCC) is better than accuracy for evaluating LLM judges because it handles class imbalance correctly. In production AI datasets, passing cases outnumber failing cases by a large margin — often 90:10 or more. In this regime, a judge that always returns "pass" achieves 90% accuracy. MCC penalizes this: a judge that always returns the same label gets MCC = 0. MCC uses all four cells of the confusion matrix (TP, TN, FP, FN) symmetrically — the only binary classification metric that does so. For production AI evaluation where false negatives have higher cost than false positives, MCC can also be decomposed to analyze FN rate separately.
What is the difference between offline and online AI evaluation?
Offline evaluation runs on a fixed dataset before deployment — your CI eval suite. It's fast, reproducible, and provides pre-deployment quality assurance, but only covers failure modes in your eval dataset. Online evaluation runs on live production traffic after deployment. It catches novel failure modes not in the offline dataset, but results arrive after deployment. The correct approach uses both: offline evals as the pre-deployment gate, online evaluation as the post-deployment monitor. The connection between the two is the annotation-to-eval loop: novel failures caught online are annotated, which generates new offline evals, extending pre-deployment protection.
Latitude implements the full evaluation methodology described here — failure mode tracking, annotation queues, GEPA eval generation, MCC quality measurement, and both offline and online evaluation in a connected pipeline. Start for free →