AI Evaluation for ML Engineers: Production-Based Eval Methodology

AI evaluation for ML engineers: production-based eval methodology, LLM-as-judge calibration with MCC, GEPA eval generation, and statistical frameworks for measuring AI quality in production.

César Miguelañez

Apr 10, 2026

By Latitude · April 9, 2026

Key Takeaways

Eval design starts with failure mode taxonomy, not metric selection. Choosing metrics before understanding what can go wrong produces evals that measure the wrong things precisely.
LLM-as-judge evals require calibration against human annotations before being deployed as quality gates. MCC is the correct alignment metric — accuracy is misleading in the class-imbalanced production data regime.
Sampling strategy is a first-class design decision: 100% for safety/critical checks, 10–30% for semantic quality, 5% for secondary dimensions. Wrong sampling rates waste compute or miss critical failures.
The offline/online split is necessary but the connection between them is the key: online failures should automatically feed the offline eval dataset via the annotation pipeline.
GEPA closes the manual bottleneck in eval generation — annotation effort compounds into an automatically growing eval library rather than requiring an engineer to author each test case.

For ML engineers, AI evaluation is both a methodology problem and a data engineering problem. The methodology determines which failures you can detect and how reliably. The data engineering determines whether the evaluation system is maintainable at scale without requiring constant manual intervention.

This guide covers both — the eval design methodology for production AI systems and the data pipeline architecture that makes it sustainable.

Failure Mode Taxonomy First

The most common mistake in eval design: choosing evaluation metrics before understanding the failure mode profile. Teams that start with "we'll use BLEU score" or "we'll use an LLM judge for quality" before analyzing what actually goes wrong in their production system end up measuring the wrong things precisely.

Start with a failure mode taxonomy. Review 50–100 production traces. Identify and name every distinct category of thing that goes wrong. Common categories for agent systems:

Tool selection errors: Agent calls wrong tool, fails to call tool when needed
Tool response misinterpretation: Correct tool called, response interpreted incorrectly
Constraint violation: Agent violates an explicit constraint established earlier in the session
Goal drift: Agent pursues a subtask in a way that contradicts the original user goal
Hallucination: Agent asserts facts not present in context, memory, or retrieved data
Scope violation: Agent responds to out-of-scope requests rather than declining appropriately
Context loss: Agent forgets context from early turns in long sessions
Premature termination: Agent concludes the session before the user's goal is accomplished

Your taxonomy will differ from this. The point is to have it before choosing eval types — because the right eval type depends on the failure mode structure.

Eval Type Selection by Failure Mode

Failure Mode	Eval Type	Why
Tool selection (wrong tool)	Rule-based	Ground truth exists: was the correct tool called? Deterministic check.
Tool call format errors	Rule-based	Schema validation. Deterministic.
Prohibited content patterns	Rule-based + LLM judge	Explicit patterns: rule-based. Nuanced violations: LLM judge.
Tool response misinterpretation	LLM judge	Requires comparing tool response to agent's subsequent claims — semantic analysis.
Constraint violation	LLM judge	Requires tracking constraint from earlier turn to later turn — semantic understanding.
Hallucination	LLM judge + human annotation	LLM judge for scalable detection; human annotation for calibration and high-severity cases.
Goal drift	LLM judge (session-level)	Requires understanding original goal and whether final state achieved it — session-level.
Context loss	LLM judge + session segmentation	Compare early-turn constraints to late-turn behavior — requires full session trace.

LLM Judge Design and Calibration

Judge prompt design principles

An LLM judge is a prompt-engineered evaluator. The quality of the judge depends on the prompt quality. Common failure modes in judge design:

Sycophancy: The judge tends to rate responses positively if they're confident and well-written, regardless of accuracy. Mitigation: include explicit instructions not to reward confidence, and include examples of confident but incorrect responses that should fail.
Position bias: The judge prefers responses earlier or later in the context when asked to compare. For comparative evals, randomize position and take the average across both orderings.
Length bias: The judge rewards longer responses. Include explicit instructions that length is not a quality criterion and include examples of short correct responses outscoring long incorrect ones.
Vagueness in criteria: "Was the response high quality?" produces inconsistent verdicts. Replace with specific criteria: "Did the agent correctly identify the user's primary goal by turn 3?"

MCC calibration against human annotations

import numpy as np
from sklearn.metrics import matthews_corrcoef
from itertools import product

def calibrate_judge(judge_fn, annotated_dataset, thresholds=None):
    """
    Find optimal threshold and report MCC for LLM judge vs. human annotations.

    Args:
        judge_fn: callable(trace) -> float (0-1 score)
        annotated_dataset: list of {trace: dict, label: int} (1=pass, 0=fail)
        thresholds: list of thresholds to test (default: 0.3 to 0.8 in 0.05 steps)
    """
    if thresholds is None:
        thresholds = np.arange(0.3, 0.85, 0.05)

    judge_scores = [judge_fn(d["trace"]) for d in annotated_dataset]
    human_labels = [d["label"] for d in annotated_dataset]

    results = []
    for threshold in thresholds:
        judge_labels = [1 if s >= threshold else 0 for s in judge_scores]
        mcc = matthews_corrcoef(human_labels, judge_labels)

        # Confusion matrix
        tp = sum(j==1 and h==1 for j,h in zip(judge_labels, human_labels))
        tn = sum(j==0 and h==0 for j,h in zip(judge_labels, human_labels))
        fp = sum(j==1 and h==0 for j,h in zip(judge_labels, human_labels))
        fn = sum(j==0 and h==1 for j,h in zip(judge_labels, human_labels))

        results.append({
            "threshold": threshold,
            "mcc": mcc,
            "fnr": fn / (fn + tp) if (fn + tp) > 0 else 0,  # False negative rate
            "fpr": fp / (fp + tn) if (fp + tn) > 0 else 0,  # False positive rate
        })

    best = max(results, key=lambda x: x["mcc"])
    class_balance = sum(human_labels) / len(human_labels)

    return {
        "best_threshold": best["threshold"],
        "best_mcc": best["mcc"],
        "best_fnr": best["fnr"],
        "best_fpr": best["fpr"],
        "class_balance": class_balance,
        "sample_size": len(annotated_dataset),
        "deployment_recommendation": (
            "deploy as gate" if best["mcc"] >= 0.6
            else "monitor only" if best["mcc"] >= 0.4
            else "do not deploy — insufficient alignment with human judgment"
        ),
        "all_thresholds": results
    }

import numpy as np
from sklearn.metrics import matthews_corrcoef
from itertools import product

def calibrate_judge(judge_fn, annotated_dataset, thresholds=None):
    """
    Find optimal threshold and report MCC for LLM judge vs. human annotations.

    Args:
        judge_fn: callable(trace) -> float (0-1 score)
        annotated_dataset: list of {trace: dict, label: int} (1=pass, 0=fail)
        thresholds: list of thresholds to test (default: 0.3 to 0.8 in 0.05 steps)
    """
    if thresholds is None:
        thresholds = np.arange(0.3, 0.85, 0.05)

    judge_scores = [judge_fn(d["trace"]) for d in annotated_dataset]
    human_labels = [d["label"] for d in annotated_dataset]

    results = []
    for threshold in thresholds:
        judge_labels = [1 if s >= threshold else 0 for s in judge_scores]
        mcc = matthews_corrcoef(human_labels, judge_labels)

        # Confusion matrix
        tp = sum(j==1 and h==1 for j,h in zip(judge_labels, human_labels))
        tn = sum(j==0 and h==0 for j,h in zip(judge_labels, human_labels))
        fp = sum(j==1 and h==0 for j,h in zip(judge_labels, human_labels))
        fn = sum(j==0 and h==1 for j,h in zip(judge_labels, human_labels))

        results.append({
            "threshold": threshold,
            "mcc": mcc,
            "fnr": fn / (fn + tp) if (fn + tp) > 0 else 0,  # False negative rate
            "fpr": fp / (fp + tn) if (fp + tn) > 0 else 0,  # False positive rate
        })

    best = max(results, key=lambda x: x["mcc"])
    class_balance = sum(human_labels) / len(human_labels)

    return {
        "best_threshold": best["threshold"],
        "best_mcc": best["mcc"],
        "best_fnr": best["fnr"],
        "best_fpr": best["fpr"],
        "class_balance": class_balance,
        "sample_size": len(annotated_dataset),
        "deployment_recommendation": (
            "deploy as gate" if best["mcc"] >= 0.6
            else "monitor only" if best["mcc"] >= 0.4
            else "do not deploy — insufficient alignment with human judgment"
        ),
        "all_thresholds": results
    }

import numpy as np
from sklearn.metrics import matthews_corrcoef
from itertools import product

def calibrate_judge(judge_fn, annotated_dataset, thresholds=None):
    """
    Find optimal threshold and report MCC for LLM judge vs. human annotations.

    Args:
        judge_fn: callable(trace) -> float (0-1 score)
        annotated_dataset: list of {trace: dict, label: int} (1=pass, 0=fail)
        thresholds: list of thresholds to test (default: 0.3 to 0.8 in 0.05 steps)
    """
    if thresholds is None:
        thresholds = np.arange(0.3, 0.85, 0.05)

    judge_scores = [judge_fn(d["trace"]) for d in annotated_dataset]
    human_labels = [d["label"] for d in annotated_dataset]

    results = []
    for threshold in thresholds:
        judge_labels = [1 if s >= threshold else 0 for s in judge_scores]
        mcc = matthews_corrcoef(human_labels, judge_labels)

        # Confusion matrix
        tp = sum(j==1 and h==1 for j,h in zip(judge_labels, human_labels))
        tn = sum(j==0 and h==0 for j,h in zip(judge_labels, human_labels))
        fp = sum(j==1 and h==0 for j,h in zip(judge_labels, human_labels))
        fn = sum(j==0 and h==1 for j,h in zip(judge_labels, human_labels))

        results.append({
            "threshold": threshold,
            "mcc": mcc,
            "fnr": fn / (fn + tp) if (fn + tp) > 0 else 0,  # False negative rate
            "fpr": fp / (fp + tn) if (fp + tn) > 0 else 0,  # False positive rate
        })

    best = max(results, key=lambda x: x["mcc"])
    class_balance = sum(human_labels) / len(human_labels)

    return {
        "best_threshold": best["threshold"],
        "best_mcc": best["mcc"],
        "best_fnr": best["fnr"],
        "best_fpr": best["fpr"],
        "class_balance": class_balance,
        "sample_size": len(annotated_dataset),
        "deployment_recommendation": (
            "deploy as gate" if best["mcc"] >= 0.6
            else "monitor only" if best["mcc"] >= 0.4
            else "do not deploy — insufficient alignment with human judgment"
        ),
        "all_thresholds": results
    }

Minimum annotation sample size

MCC estimates are unstable with small samples. Minimum sample sizes by expected class balance:

80/20 (pass/fail) split: 100 annotations minimum; 200 for reliable MCC estimate
90/10 split: 200 annotations minimum; 500 for reliable estimate
95/5 split: 500 annotations minimum — consider oversampling failures if available

If you don't have enough annotations to calibrate reliably, use the judge in monitoring-only mode (no deployment blocking) until the annotation dataset is large enough.

The Offline/Online Architecture

Offline eval dataset management

The offline eval dataset should contain:

Production sessions that represent the failure modes in your taxonomy (typically 20–50 examples per failure mode)
Production sessions that represent nominal good performance (to catch false positive regressions)
Adversarial cases for high-severity failure modes (edge cases that should fail)

Dataset hygiene: review the dataset quarterly. Remove sessions that are no longer representative of current production patterns. Add sessions from recent production failures. Maintain roughly 70% production-sampled / 30% curated adversarial cases.

Online monitoring architecture

from dataclasses import dataclass
from collections import deque
from scipy import stats
import time

@dataclass
class QualityBaseline:
    failure_mode: str
    scores: deque  # Rolling window of quality scores
    window_size: int = 1000

    def add_score(self, score: float):
        self.scores.append(score)
        if len(self.scores) > self.window_size:
            self.scores.popleft()

    def get_baseline_stats(self) -> dict:
        scores_list = list(self.scores)
        return {
            "mean": sum(scores_list) / len(scores_list),
            "n": len(scores_list)
        }


def detect_online_regression(
    baseline: QualityBaseline,
    recent_scores: list[float],
    alpha: float = 0.01,
    min_effect_size: float = 0.05
) -> dict:
    """
    Detect statistically significant quality regression in recent production window.
    Uses Welch's t-test with minimum effect size filter to reduce noise.
    """
    baseline_scores = list(baseline.scores)

    if len(baseline_scores) < 50 or len(recent_scores) < 50:
        return {"sufficient_data": False}

    t_stat, p_value = stats.ttest_ind(
        baseline_scores, recent_scores,
        equal_var=False
    )

    baseline_mean = sum(baseline_scores) / len(baseline_scores)
    recent_mean = sum(recent_scores) / len(recent_scores)
    effect_size = (recent_mean - baseline_mean) / baseline_mean

    regression = (
        p_value < alpha and
        effect_size < -min_effect_size  # Negative = quality drop
    )

    return {
        "sufficient_data": True,
        "regression_detected": regression,
        "baseline_mean": baseline_mean,
        "recent_mean": recent_mean,
        "effect_size_pct": effect_size * 100,
        "p_value": p_value,
        "failure_mode": baseline.failure_mode
    }

from dataclasses import dataclass
from collections import deque
from scipy import stats
import time

@dataclass
class QualityBaseline:
    failure_mode: str
    scores: deque  # Rolling window of quality scores
    window_size: int = 1000

    def add_score(self, score: float):
        self.scores.append(score)
        if len(self.scores) > self.window_size:
            self.scores.popleft()

    def get_baseline_stats(self) -> dict:
        scores_list = list(self.scores)
        return {
            "mean": sum(scores_list) / len(scores_list),
            "n": len(scores_list)
        }


def detect_online_regression(
    baseline: QualityBaseline,
    recent_scores: list[float],
    alpha: float = 0.01,
    min_effect_size: float = 0.05
) -> dict:
    """
    Detect statistically significant quality regression in recent production window.
    Uses Welch's t-test with minimum effect size filter to reduce noise.
    """
    baseline_scores = list(baseline.scores)

    if len(baseline_scores) < 50 or len(recent_scores) < 50:
        return {"sufficient_data": False}

    t_stat, p_value = stats.ttest_ind(
        baseline_scores, recent_scores,
        equal_var=False
    )

    baseline_mean = sum(baseline_scores) / len(baseline_scores)
    recent_mean = sum(recent_scores) / len(recent_scores)
    effect_size = (recent_mean - baseline_mean) / baseline_mean

    regression = (
        p_value < alpha and
        effect_size < -min_effect_size  # Negative = quality drop
    )

    return {
        "sufficient_data": True,
        "regression_detected": regression,
        "baseline_mean": baseline_mean,
        "recent_mean": recent_mean,
        "effect_size_pct": effect_size * 100,
        "p_value": p_value,
        "failure_mode": baseline.failure_mode
    }

from dataclasses import dataclass
from collections import deque
from scipy import stats
import time

@dataclass
class QualityBaseline:
    failure_mode: str
    scores: deque  # Rolling window of quality scores
    window_size: int = 1000

    def add_score(self, score: float):
        self.scores.append(score)
        if len(self.scores) > self.window_size:
            self.scores.popleft()

    def get_baseline_stats(self) -> dict:
        scores_list = list(self.scores)
        return {
            "mean": sum(scores_list) / len(scores_list),
            "n": len(scores_list)
        }


def detect_online_regression(
    baseline: QualityBaseline,
    recent_scores: list[float],
    alpha: float = 0.01,
    min_effect_size: float = 0.05
) -> dict:
    """
    Detect statistically significant quality regression in recent production window.
    Uses Welch's t-test with minimum effect size filter to reduce noise.
    """
    baseline_scores = list(baseline.scores)

    if len(baseline_scores) < 50 or len(recent_scores) < 50:
        return {"sufficient_data": False}

    t_stat, p_value = stats.ttest_ind(
        baseline_scores, recent_scores,
        equal_var=False
    )

    baseline_mean = sum(baseline_scores) / len(baseline_scores)
    recent_mean = sum(recent_scores) / len(recent_scores)
    effect_size = (recent_mean - baseline_mean) / baseline_mean

    regression = (
        p_value < alpha and
        effect_size < -min_effect_size  # Negative = quality drop
    )

    return {
        "sufficient_data": True,
        "regression_detected": regression,
        "baseline_mean": baseline_mean,
        "recent_mean": recent_mean,
        "effect_size_pct": effect_size * 100,
        "p_value": p_value,
        "failure_mode": baseline.failure_mode
    }

Frequently Asked Questions

How do ML engineers design evaluations for production AI systems?

Production AI evaluation design for ML engineers requires: (1) Starting with failure mode taxonomy — categorize what can go wrong in your specific system before writing any eval. (2) Choosing eval type by failure category — rule-based for structural requirements, LLM-as-judge for semantic quality. (3) Calibrating LLM judges against human annotations using MCC before deploying them as quality gates. (4) Setting sampling rates per eval — 100% for critical safety checks, 10–30% for semantic quality evals. (5) Building a measurement loop — eval pass rates are tracked as time series, and regressions trigger annotation queue prioritization for the next cycle.

Why is MCC better than accuracy for evaluating LLM judges?

Matthews Correlation Coefficient (MCC) is better than accuracy for evaluating LLM judges because it handles class imbalance correctly. In production AI datasets, passing cases outnumber failing cases by a large margin — often 90:10 or more. In this regime, a judge that always returns "pass" achieves 90% accuracy. MCC penalizes this: a judge that always returns the same label gets MCC = 0. MCC uses all four cells of the confusion matrix (TP, TN, FP, FN) symmetrically — the only binary classification metric that does so. For production AI evaluation where false negatives have higher cost than false positives, MCC can also be decomposed to analyze FN rate separately.

What is the difference between offline and online AI evaluation?

Offline evaluation runs on a fixed dataset before deployment — your CI eval suite. It's fast, reproducible, and provides pre-deployment quality assurance, but only covers failure modes in your eval dataset. Online evaluation runs on live production traffic after deployment. It catches novel failure modes not in the offline dataset, but results arrive after deployment. The correct approach uses both: offline evals as the pre-deployment gate, online evaluation as the post-deployment monitor. The connection between the two is the annotation-to-eval loop: novel failures caught online are annotated, which generates new offline evals, extending pre-deployment protection.

Latitude implements the full evaluation methodology described here — failure mode tracking, annotation queues, GEPA eval generation, MCC quality measurement, and both offline and online evaluation in a connected pipeline. Start for free →

AI Evaluation for ML Engineers: Production-Based Eval Methodology

AI Evaluation for ML Engineers: Production-Based Eval Methodology

Failure Mode Taxonomy First

Eval Type Selection by Failure Mode

LLM Judge Design and Calibration

Judge prompt design principles

MCC calibration against human annotations

Minimum annotation sample size

The Offline/Online Architecture

Offline eval dataset management

Online monitoring architecture

Frequently Asked Questions

How do ML engineers design evaluations for production AI systems?

Why is MCC better than accuracy for evaluating LLM judges?

What is the difference between offline and online AI evaluation?

Related Blog Posts

Recent articles

Why Expert Feedback Matters for LLM Reliability

Evaluating Scalability in LLM Pipelines

Why Expert Feedback Matters for LLM Reliability

Evaluating Scalability in LLM Pipelines

7 LLM Observability Tools Compared 2026

Automated Regression Testing for LLMs