How Latitude AI Evaluations Work: GEPA and Production-Based Testing

▣APRIL 10, 2026

By Latitude · April 9, 2026

What This Article Covers

How GEPA generates evaluations from annotated production failure modes
The two eval types Latitude supports and when each is used
How eval quality is measured using MCC
Eval suite coverage — what it is and why it matters
How evals connect to issue tracking and the reliability loop
CI/CD integration and deployment gating

Most AI evaluation platforms give you tools to author tests. Latitude gives you a system that generates tests automatically from what’s actually going wrong in production — and measures whether those tests are any good.

This article explains how that system works.

The Core Principle: Evals from Production, Not from Imagination

The standard approach to AI evaluation is to author test cases before or shortly after launch — think through what might go wrong, write test cases for those scenarios, and run them before each deployment.

This works reasonably well early in a product’s lifecycle. It breaks down quickly as the product matures, because:

Production surfaces failure modes the team didn’t anticipate
The failure mode profile changes as the user base and use cases evolve
Manually maintaining test cases that reflect production reality is high-effort work that consistently loses priority

Latitude’s approach: instead of authoring tests from imagination, derive them from production. Every failure mode that’s been observed and annotated in production becomes a test that will catch it in future deployments. The eval suite grows automatically as the team annotates — which means it stays aligned with production reality without requiring manual curation.

How GEPA Works

GEPA (Generative Evaluation Pipeline Algorithm) is the mechanism that converts annotated failure modes into evaluators. The process has four stages:

Stage 1: Annotation accumulation

Domain experts review production traces through Latitude’s annotation queues. For each trace, they classify the output quality and, if it’s a failure, identify the failure mode category. This creates a labeled dataset: traces tagged as “good” or “bad” with failure mode categories for the bad ones.

The annotation queue surfaces traces most likely to contain failures — anomaly-flagged traces, high-complexity sessions, inputs that match known failure mode patterns — so annotation time is spent efficiently rather than reviewing random samples.

Stage 2: Pattern extraction

Once a failure mode has enough annotated examples (typically 10–20 well-annotated traces), GEPA analyzes the labeled dataset to extract the signal that distinguishes failing outputs from passing ones for that specific failure category.

The nature of the extracted pattern determines the eval type GEPA generates:

Deterministic structural patterns (the output always lacks a required field when it fails; the tool call always uses a wrong parameter type) → GEPA generates a rule-based eval
Semantic quality patterns (the agent interprets the user’s request in a way that diverges from their actual intent; the response states policies that weren’t in the system context) → GEPA generates an LLM-as-judge eval with a prompt designed to capture that specific quality dimension

Stage 3: Evaluator generation and validation

GEPA generates the evaluator — either a rule-based function or an LLM-as-judge prompt — and immediately validates it against the annotation dataset using Matthews Correlation Coefficient (MCC).

MCC measures the correlation between the evaluator’s verdicts and the human annotations that generated it. A high MCC (above 0.6) means the evaluator reliably identifies the same failures that humans flagged. A low MCC means the evaluator needs more annotation signal or a different approach, and Latitude surfaces it for review rather than deploying it as a blocking gate.

GEPA doesn’t generate evaluators once and stop. As annotation volume grows, GEPA re-runs for each failure mode category, refining existing evaluators with the expanded dataset and generating new ones for newly identified failure mode categories. MCC is recalculated periodically to track whether each evaluator’s alignment with human judgment is improving or degrading.

This means the eval suite is a living artifact that reflects the team’s accumulated knowledge of production failure modes — not a static benchmark from launch day.

Eval Types in Latitude

Rule-based evals

Rule-based evals are deterministic checks: assertions, regex patterns, schema validation, or any function that takes a trace and returns a binary verdict without calling an LLM. They’re fast (milliseconds per trace), cheap (no API calls), and perfectly reproducible — the same input always produces the same verdict.

When to use rule-based evals:

Output format requirements (JSON schema, specific field presence)
Prohibited content patterns (specific phrases, content categories detectable via regex)
Tool call format validation (correct tool called, required parameters present)
Response length constraints
Any quality criterion that can be expressed as a deterministic check

Rule-based evals should run at 100% sampling rate for critical structural checks. They’re cheap enough that there’s no reason to sample them.

LLM-as-judge evals

LLM-as-judge evals use a language model to assess quality dimensions that require semantic understanding. They’re more expensive, less deterministic, and require calibration — but they can assess things rule-based evals can’t: task completion, context coherence, response accuracy relative to retrieved context, tone alignment.

When to use LLM-as-judge evals:

Task completion assessment (did the agent accomplish the user’s goal?)
Factual accuracy (does the response correctly reflect the information in context?)
Constraint compliance (does the response respect constraints established in earlier turns?)
Tone and framing quality (any dimension that requires reading and understanding the response)

LLM-as-judge evals should be sampled (typically 10–30% of production sessions for most categories) unless the failure mode is critical enough to warrant 100% coverage. Sampling rates are configurable per evaluator in Latitude.

All LLM-as-judge evals have an associated MCC score that must be above 0.4 before they’re deployed as deployment gates, and above 0.6 before they’re deployed as blocking gates. Evals with MCC below 0.4 are shown in the eval suite but don’t gate deployments — they’re in monitoring-only mode until more annotations improve their reliability.

Eval Quality: MCC and Why It Matters

An evaluator that doesn’t correlate with human judgment is worse than no evaluator — it provides false confidence and can block good deployments or miss bad ones.

Latitude measures eval quality using Matthews Correlation Coefficient (MCC), which has several properties that make it the right metric for this use case:

Class imbalance robustness: In production AI datasets, passing cases far outnumber failing cases. MCC correctly penalizes an evaluator that always returns “pass” — it gets MCC = 0, not 90% accuracy.
Symmetric treatment of both classes: MCC treats false positives (blocking good deployments) and false negatives (missing real failures) equally in the base metric. For specific evals where these costs are asymmetric, the confusion matrix decomposition shows FPR and FNR separately.
Interpretable scale: MCC ranges from -1 to +1. Above 0.6 = reliable. Between 0.4 and 0.6 = use with caution. Below 0.4 = not ready for deployment gating.

MCC is displayed for every evaluator in Latitude’s eval suite dashboard and updated automatically as new annotations accumulate. When an evaluator’s MCC drops below threshold (indicating that new annotation data is revealing the evaluator was overfitted to the initial sample), Latitude surfaces it for review.

Eval Suite Coverage

Eval quality answers: “Is this evaluator reliable?” Eval suite coverage answers: “Are we protecting the right things?”

Coverage is defined as: what percentage of currently active, tracked failure modes have a corresponding evaluator with acceptable MCC? A coverage gap is a failure mode that can regress without being caught by the eval suite.

Latitude tracks coverage continuously and surfaces the gap to the team: “You have 14 active failure modes tracked in the issue dashboard. 9 have corresponding evaluators. Your coverage is 64%. The 5 uncovered failure modes are: [list with annotation volume for each].”

This metric doesn’t exist in any other AI evaluation platform. It turns eval suite maintenance from a vague responsibility (“we should write more tests”) into a specific, trackable metric with a clear path to improvement (annotate more examples of the uncovered failure modes).

Connection to Issue Tracking

Every evaluator in Latitude has a provenance story: it was generated from annotations of a specific tracked issue. When an eval fails in CI, Latitude shows you which issue it’s protecting — “This eval failure indicates a regression on Issue #14: Tool Response Misinterpretation in Billing Queries.”

When an issue is resolved and verified — post-deployment monitoring confirms the failure mode rate has decreased and the corresponding eval passes consistently — the issue moves to “verified” status. If the failure mode recurs in a future deployment, the issue status changes to “regressed” and the team is notified.

This connection makes the eval suite’s meaning auditable and the quality management process visible to the whole team — not just the engineers who wrote the evals.

CI/CD Integration

Latitude’s eval suite integrates into CI as a step that runs before deployment approval. The integration is available via API — any CI system that can make API calls can integrate with it.

The CI step:

Pulls the current eval dataset snapshot from Latitude
Runs the candidate model configuration against the dataset
Compares results to the baseline (previous deployment’s eval results)
Returns pass/fail with a breakdown by evaluator and failure mode category
Blocks the deployment if any high-MCC evaluator covering a critical failure mode regresses

Eval runs are cached — running the same model configuration against the same dataset snapshot returns the cached result, so CI runs stay fast as the eval suite grows.

Frequently Asked Questions

What is GEPA and how does it generate evaluations?

GEPA (Generative Evaluation Pipeline Algorithm) is Latitude’s algorithm for automatically generating evaluations from human-annotated production failure modes. The process: domain experts annotate production traces in Latitude’s annotation queues, classifying failure modes and defining what “good” looks like. GEPA analyzes these annotations to extract the signal that distinguishes failing outputs from passing ones, then generates an evaluator — either rule-based or LLM-as-judge. The generated evaluator is validated using Matthews Correlation Coefficient (MCC) to confirm it aligns with human judgment. As annotation volume grows, GEPA refines existing evaluators and generates new ones for newly observed failure mode categories.

How is Latitude’s evaluation approach different from other platforms?

Four things differentiate Latitude’s evaluation approach: (1) Evals are generated from production failure modes, not authored from scratch — the eval suite reflects actual production risk. (2) Eval quality is measured and tracked using MCC — most platforms don’t measure eval quality at all. (3) Eval suite coverage is tracked — Latitude tracks what percentage of active failure modes have a corresponding evaluator; no other platform offers this. (4) The eval system is connected to issue tracking — each eval has a provenance story (which issue it’s protecting), making the eval suite auditable.

What types of evaluations does Latitude support?

Latitude supports two primary evaluation types: Rule-based evals (assertions, regex, schema validation, deterministic checks) — fast, cheap, perfectly reproducible. Used for structural requirements: output format validation, prohibited content patterns, tool call format checking. LLM-based evals (LLM-as-judge) — use a language model to assess semantic quality dimensions that can’t be captured with rules: task completion, response accuracy, tone alignment, context coherence. Both types can be generated by GEPA from annotated production failure modes, or authored manually. All LLM-as-judge evals are calibrated against human annotations using MCC before being deployed as quality gates.

Latitude’s free plan includes 50M eval tokens per month — enough to run GEPA-generated evaluations on your production traffic and validate the workflow before committing to a paid plan. Start for free →

How Latitude AI Evaluations Work: GEPA and Production-Based Testing

The Core Principle: Evals from Production, Not from Imagination

How GEPA Works

Stage 1: Annotation accumulation

Stage 2: Pattern extraction

Stage 3: Evaluator generation and validation

Stage 4: Continuous refinement

Eval Types in Latitude

Rule-based evals

LLM-as-judge evals

Eval Quality: MCC and Why It Matters

Eval Suite Coverage

Connection to Issue Tracking

CI/CD Integration

Frequently Asked Questions

What is GEPA and how does it generate evaluations?

How is Latitude’s evaluation approach different from other platforms?

What types of evaluations does Latitude support?

More

The Core Principle: Evals from Production, Not from Imagination

How GEPA Works

Stage 1: Annotation accumulation

Stage 2: Pattern extraction

Stage 3: Evaluator generation and validation

Stage 4: Continuous refinement

Eval Types in Latitude

Rule-based evals

LLM-as-judge evals

Eval Quality: MCC and Why It Matters

Eval Suite Coverage

Connection to Issue Tracking

CI/CD Integration

Frequently Asked Questions

What is GEPA and how does it generate evaluations?

How is Latitude’s evaluation approach different from other platforms?

What types of evaluations does Latitude support?

More

Related Blog Posts