>

How to Generate AI Evaluations from Real Production Data

How to Generate AI Evaluations from Real Production Data

How to Generate AI Evaluations from Real Production Data

Generate AI evaluations from real production data, not synthetic benchmarks. Step-by-step guide: instrumentation, annotation, GEPA auto-generation, and eval quality measurement.

César Miguelañez

By Latitude · April 9, 2026

Key Takeaways

  • Synthetic benchmarks test what you anticipated — production data captures what users actually do. The gap between them is where most production AI failures live.

  • Generating evals from production data requires four steps: instrument traces → annotate failure modes → auto-generate evals with GEPA → measure eval quality with MCC.

  • Human annotation is the rate-limiting step. The goal isn't to annotate everything — it's to annotate the right traces: the ones most likely to contain failure modes worth capturing.

  • GEPA converts each annotated failure pattern into a reusable evaluator that runs in CI before every deployment. The eval suite grows automatically as annotation volume grows.

  • Eval quality is measurable. MCC tracks how well each generated eval correlates with human judgment — giving you signal on which evals are reliable and which need refinement.

  • The end state: every known production failure pattern has a corresponding eval. New deployments can't regress on known failure modes without being caught before release.

Most AI eval pipelines start with a spreadsheet. A few dozen input-output pairs, hand-curated by the team, covering the failure modes they expected to see. This works until it doesn't — which is usually within the first few months of real production traffic.

Users do things the team never anticipated. Usage patterns drift. New failure modes appear that nobody imagined during development. The benchmark suite stays fixed while production reality moves on, and the gap between "passing evals" and "working in production" quietly widens.

The alternative is to generate evals from production data: to use the actual distribution of user behavior as the source of your test cases, and to build a loop where every failure mode that surfaces in production automatically becomes a test that will catch it pre-deployment next time.

This guide walks through how to build that loop — from instrumentation through annotation workflows through automatic eval generation — with implementation details and code examples throughout.

Why Synthetic Benchmarks Fall Short

Synthetic benchmarks fail production AI teams in three distinct ways.

They test what you anticipated

Every hand-authored test case represents something the team expected might go wrong. Production users reliably find things the team didn't expect. They phrase requests in ways that trigger edge cases in the prompt. They provide partial information that breaks the agent's assumptions. They combine features in combinations the developers never tested. No benchmark author anticipates the full distribution of real user behavior — and the tail of that distribution is where most production failures cluster.

They go stale

A benchmark built for v1 of your agent doesn't reflect the failure modes introduced by the v2 prompt changes, or the new tools added in v3, or the different user segments that started adopting the product in month 4. Maintaining a static benchmark suite requires continuous manual updates from the team — work that consistently loses priority to shipping features. The benchmark drifts away from production reality as the product evolves, until it's measuring something different from what users actually experience.

They can be gamed without improving production quality

Teams that optimize primarily for benchmark performance often find that scores improve while real user outcomes stay flat. This isn't intentional — it's a natural consequence of evaluating on a fixed, known test set. When the model and the benchmark share training signal, overfitting to the benchmark is easy and hard to detect. Production-based evals are harder to overfit because they sample from the full distribution of real behavior, not from a curated subset.

What Production-Based Evals Look Like

Production-based evals are evaluations derived from real production traces rather than synthetic examples. The source of each test case is an actual interaction that happened — an input a real user sent, an output your AI produced, a failure mode a domain expert identified in production data.

This matters for three reasons:

  1. Distribution alignment. The test cases reflect the actual distribution of how users interact with your product — including the long tail of unusual inputs, ambiguous phrasing, and edge cases that synthetic data doesn't capture.

  2. Automatic updating. As new failure modes appear in production, they become new sources of test cases. The eval suite grows in the direction of actual risk rather than anticipated risk.

  3. Ground truth grounding. When a human domain expert identifies a failure in a real production trace, that judgment is ground truth. Evals generated from annotated real failures have a clear, documented source of truth — the actual failure that was observed and the human judgment that identified it as a failure.

The practical challenge is workflow: how do you go from production traces to reusable eval cases, at scale, without requiring the team to manually author every test? That's what production-based eval generation solves.

The Four-Step Loop

Generating evals from production data follows a four-step loop that runs continuously as long as the product is in production.

Step 1: Observe — Capture Production Traces

Everything starts with instrumentation. You cannot generate evals from production data you haven't captured. For a production AI system, this means capturing full traces: all inputs, outputs, intermediate reasoning steps, tool calls, and conversation turns — connected by a session or request identifier so you can reconstruct what happened.

Here's a minimal Python instrumentation example using OpenTelemetry:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure OTLP exporter (Latitude accepts OTLP format)
exporter = OTLPSpanExporter(
    endpoint="https://otelgateway.latitude.so",
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("your-ai-app")

def call_llm_with_tracing(messages: list[dict], model: str = "gpt-4o") -> str:
    with tracer.start_as_current_span("llm_call") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("input.value", str(messages))

        response = openai_client.chat.completions.create(
            model=model,
            messages=messages
        )
        output = response.choices[0].message.content

        span.set_attribute("output.value", output)
        span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)

        return output
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure OTLP exporter (Latitude accepts OTLP format)
exporter = OTLPSpanExporter(
    endpoint="https://otelgateway.latitude.so",
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("your-ai-app")

def call_llm_with_tracing(messages: list[dict], model: str = "gpt-4o") -> str:
    with tracer.start_as_current_span("llm_call") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("input.value", str(messages))

        response = openai_client.chat.completions.create(
            model=model,
            messages=messages
        )
        output = response.choices[0].message.content

        span.set_attribute("output.value", output)
        span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)

        return output
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure OTLP exporter (Latitude accepts OTLP format)
exporter = OTLPSpanExporter(
    endpoint="https://otelgateway.latitude.so",
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("your-ai-app")

def call_llm_with_tracing(messages: list[dict], model: str = "gpt-4o") -> str:
    with tracer.start_as_current_span("llm_call") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("input.value", str(messages))

        response = openai_client.chat.completions.create(
            model=model,
            messages=messages
        )
        output = response.choices[0].message.content

        span.set_attribute("output.value", output)
        span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)

        return output

For agent workflows with tool calls, capture each tool call as a child span within the parent session span:

def run_agent_turn(session_id: str, messages: list[dict]) -> dict:
    with tracer.start_as_current_span("agent_turn") as turn_span:
        turn_span.set_attribute("session.id", session_id)

        response = llm_client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=AVAILABLE_TOOLS
        )

        if response.choices[0].message.tool_calls:
            for tool_call in response.choices[0].message.tool_calls:
                with tracer.start_as_current_span("tool_call") as tool_span:
                    tool_span.set_attribute("tool.name", tool_call.function.name)
                    tool_span.set_attribute("tool.input", tool_call.function.arguments)

                    result = execute_tool(tool_call)

                    tool_span.set_attribute("tool.output", str(result))
                    tool_span.set_attribute("tool.success", result.get("success", False))

        return response.choices[0].message
def run_agent_turn(session_id: str, messages: list[dict]) -> dict:
    with tracer.start_as_current_span("agent_turn") as turn_span:
        turn_span.set_attribute("session.id", session_id)

        response = llm_client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=AVAILABLE_TOOLS
        )

        if response.choices[0].message.tool_calls:
            for tool_call in response.choices[0].message.tool_calls:
                with tracer.start_as_current_span("tool_call") as tool_span:
                    tool_span.set_attribute("tool.name", tool_call.function.name)
                    tool_span.set_attribute("tool.input", tool_call.function.arguments)

                    result = execute_tool(tool_call)

                    tool_span.set_attribute("tool.output", str(result))
                    tool_span.set_attribute("tool.success", result.get("success", False))

        return response.choices[0].message
def run_agent_turn(session_id: str, messages: list[dict]) -> dict:
    with tracer.start_as_current_span("agent_turn") as turn_span:
        turn_span.set_attribute("session.id", session_id)

        response = llm_client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=AVAILABLE_TOOLS
        )

        if response.choices[0].message.tool_calls:
            for tool_call in response.choices[0].message.tool_calls:
                with tracer.start_as_current_span("tool_call") as tool_span:
                    tool_span.set_attribute("tool.name", tool_call.function.name)
                    tool_span.set_attribute("tool.input", tool_call.function.arguments)

                    result = execute_tool(tool_call)

                    tool_span.set_attribute("tool.output", str(result))
                    tool_span.set_attribute("tool.success", result.get("success", False))

        return response.choices[0].message

The critical requirement is that traces are complete and connected. A partial trace — missing tool calls, truncated outputs, or disconnected sessions — produces lower-quality annotations and lower-quality evals downstream.

Step 2: Annotate — Surface and Capture Failure Modes

Once traces are flowing, the next step is identifying which ones contain failures worth capturing. This is where the approach diverges from simply logging.

Randomly sampling production traces for review is inefficient. Most production traces are nominal — they represent the agent working correctly. Reviewing 100 nominal traces to find 3 failures wastes annotator time and produces a low signal-to-noise annotation dataset.

The right approach is anomaly-prioritized review: surface the traces most likely to contain failure modes based on signals like:

  • Unusually long sessions (context window pressure, agent loops)

  • High token usage relative to task complexity

  • Sessions where the user re-submitted the same request (implicit dissatisfaction signal)

  • Tool call sequences that deviate from expected patterns

  • Outputs that contain uncertainty markers ("I'm not sure", "I don't know", hedging language in high-confidence contexts)

Domain experts review these prioritized traces, classify the output quality, and identify specific failure modes. The key output of each annotation is:

  1. A quality verdict (good / bad / partial)

  2. A failure mode label — what specifically went wrong (e.g., "policy hallucination", "context loss after turn 8", "wrong tool called for lookup")

  3. Optionally, a corrected output that demonstrates what "good" looks like for this input

These annotations are the raw material for eval generation. Their quality determines the quality of the evals that get generated from them — which means the annotation workflow is the most important part of the pipeline to get right.

A practical note on annotation volume: You don't need thousands of annotations to start generating useful evals. Ten well-annotated examples of the same failure mode is enough for GEPA to generate a reliable evaluator for that pattern. The goal in the first few weeks is to identify and annotate your top 5–10 failure modes, not to achieve comprehensive coverage of every trace.

Step 3: Generate — Auto-Create Evals with GEPA

GEPA (Generative Evaluation Pipeline Algorithm) converts annotated failure patterns into evaluators that run automatically. The process works as follows:

  1. Pattern extraction: GEPA analyzes the annotated traces to identify what distinguishes "bad" outputs from "good" ones for each labeled failure mode. What linguistic or structural features appear in the failures? What's systematically absent in the passing outputs?

  2. Evaluator generation: Based on the extracted patterns, GEPA generates an evaluator — either a rule-based check (if the pattern is deterministic) or an LLM-as-judge prompt (if the pattern requires semantic understanding). The evaluator encodes the human judgment from the annotations into a reusable test.

  3. Quality validation: Before adding the generated eval to the suite, GEPA validates its quality against the annotation dataset using Matthews Correlation Coefficient (MCC). High MCC means the eval reliably identifies the same failures that humans flagged. Low MCC means the eval needs more annotation signal or manual refinement.

  4. Continuous refinement: As more annotations accumulate, GEPA re-runs for each failure mode. Evaluators are updated as the pattern library grows. MCC is recalculated periodically to track whether eval quality is improving or degrading over time.

The result is an eval suite that grows automatically from production data. Every failure mode that gets annotated produces a corresponding evaluator. The team's annotation effort compounds over time — 10 annotations this week become a new eval that will catch that failure mode in every future deployment.

Step 4: Iterate — Measure Eval Quality and Close the Loop

Two metrics determine whether your production-based eval system is working.

Eval quality (MCC alignment score): For each eval in the suite, MCC measures the correlation between the eval's verdicts and the human annotations that generated it. A score above 0.6 indicates reliable alignment. Below 0.4 means the eval is inconsistent with human judgment and shouldn't be used as a deployment gate. Track MCC over time — if an eval's alignment score drops as new annotations come in, it means the eval was overfitted to the initial annotation set and needs refinement.

Eval suite coverage: What percentage of your known active failure modes have a corresponding eval? If you have 12 failure modes tracked in your issue dashboard and 8 corresponding evals, your coverage is 67%. The 4 uncovered failure modes can still regress without being caught pre-deployment. Coverage tracking makes the gap visible — and prioritizable.

The iteration loop runs continuously:

  • New production failures appear → annotation queue surfaces them for review

  • Domain experts annotate → GEPA generates or refines evals

  • Updated evals run in CI → quality-drop deployments are blocked

  • Post-deployment monitoring confirms regressions are caught → loop closes

Implementation: Putting the Pipeline Together

Setting Up Annotation Queues

An annotation queue is a filtered, prioritized view of production traces for a specific reviewer. Effective queues have three properties:

  1. Filtered by signal: Only traces that have anomaly indicators, low confidence scores, or specific patterns appear in the queue. Not all traces.

  2. Prioritized by impact: Traces representing high-frequency patterns appear before rare ones. Fixing the most common failure mode first produces more total improvement than fixing rare edge cases.

  3. Assigned to the right reviewer: Domain experts who understand what correct behavior looks like for a given part of the product — not just any engineer.

Building the Eval Runner

Once GEPA has generated evals, you need to run them in CI. Here's a minimal Python eval runner that executes a suite of generated evaluators against a test dataset:

import json
from dataclasses import dataclass
from typing import Callable

@dataclass
class GeneratedEval:
    id: str
    failure_mode: str
    eval_type: str  # "rule" or "llm_judge"
    evaluator: Callable[[dict], bool]
    mcc_score: float
    coverage_target: str  # which issue this eval covers

def run_eval_suite(
    eval_cases: list[dict],
    evals: list[GeneratedEval],
    blocking_threshold: float = 0.80
) -> dict:
    """
    Run all evals against test cases.
    Returns pass/fail per eval plus overall deployment decision.
    """
    results = {}
    deployment_blocked = False
    blocking_reason = None

    for eval_ in evals:
        # Skip low-quality evals (MCC below threshold)
        if eval_.mcc_score < 0.4:
            results[eval_.id] = {"skipped": True, "reason": "low_mcc"}
            continue

        verdicts = []
        for case in eval_cases:
            verdict = eval_.evaluator(case)
            verdicts.append(verdict)

        pass_rate = sum(verdicts) / len(verdicts)
        results[eval_.id] = {
            "pass_rate": pass_rate,
            "passed": pass_rate >= blocking_threshold,
            "failure_mode": eval_.failure_mode,
            "sample_count": len(verdicts)
        }

        # High-MCC evals covering active issues block deployment
        if eval_.mcc_score >= 0.6 and not results[eval_.id]["passed"]:
            deployment_blocked = True
            blocking_reason = eval_.failure_mode

    return {
        "deployment_blocked": deployment_blocked,
        "blocking_reason": blocking_reason,
        "eval_results": results,
        "suite_pass_rate": sum(
            1 for r in results.values()
            if r.get("passed", False)
        ) / len(results)
    }
import json
from dataclasses import dataclass
from typing import Callable

@dataclass
class GeneratedEval:
    id: str
    failure_mode: str
    eval_type: str  # "rule" or "llm_judge"
    evaluator: Callable[[dict], bool]
    mcc_score: float
    coverage_target: str  # which issue this eval covers

def run_eval_suite(
    eval_cases: list[dict],
    evals: list[GeneratedEval],
    blocking_threshold: float = 0.80
) -> dict:
    """
    Run all evals against test cases.
    Returns pass/fail per eval plus overall deployment decision.
    """
    results = {}
    deployment_blocked = False
    blocking_reason = None

    for eval_ in evals:
        # Skip low-quality evals (MCC below threshold)
        if eval_.mcc_score < 0.4:
            results[eval_.id] = {"skipped": True, "reason": "low_mcc"}
            continue

        verdicts = []
        for case in eval_cases:
            verdict = eval_.evaluator(case)
            verdicts.append(verdict)

        pass_rate = sum(verdicts) / len(verdicts)
        results[eval_.id] = {
            "pass_rate": pass_rate,
            "passed": pass_rate >= blocking_threshold,
            "failure_mode": eval_.failure_mode,
            "sample_count": len(verdicts)
        }

        # High-MCC evals covering active issues block deployment
        if eval_.mcc_score >= 0.6 and not results[eval_.id]["passed"]:
            deployment_blocked = True
            blocking_reason = eval_.failure_mode

    return {
        "deployment_blocked": deployment_blocked,
        "blocking_reason": blocking_reason,
        "eval_results": results,
        "suite_pass_rate": sum(
            1 for r in results.values()
            if r.get("passed", False)
        ) / len(results)
    }
import json
from dataclasses import dataclass
from typing import Callable

@dataclass
class GeneratedEval:
    id: str
    failure_mode: str
    eval_type: str  # "rule" or "llm_judge"
    evaluator: Callable[[dict], bool]
    mcc_score: float
    coverage_target: str  # which issue this eval covers

def run_eval_suite(
    eval_cases: list[dict],
    evals: list[GeneratedEval],
    blocking_threshold: float = 0.80
) -> dict:
    """
    Run all evals against test cases.
    Returns pass/fail per eval plus overall deployment decision.
    """
    results = {}
    deployment_blocked = False
    blocking_reason = None

    for eval_ in evals:
        # Skip low-quality evals (MCC below threshold)
        if eval_.mcc_score < 0.4:
            results[eval_.id] = {"skipped": True, "reason": "low_mcc"}
            continue

        verdicts = []
        for case in eval_cases:
            verdict = eval_.evaluator(case)
            verdicts.append(verdict)

        pass_rate = sum(verdicts) / len(verdicts)
        results[eval_.id] = {
            "pass_rate": pass_rate,
            "passed": pass_rate >= blocking_threshold,
            "failure_mode": eval_.failure_mode,
            "sample_count": len(verdicts)
        }

        # High-MCC evals covering active issues block deployment
        if eval_.mcc_score >= 0.6 and not results[eval_.id]["passed"]:
            deployment_blocked = True
            blocking_reason = eval_.failure_mode

    return {
        "deployment_blocked": deployment_blocked,
        "blocking_reason": blocking_reason,
        "eval_results": results,
        "suite_pass_rate": sum(
            1 for r in results.values()
            if r.get("passed", False)
        ) / len(results)
    }

Connecting Eval Results Back to Issue Tracking

The final piece is closing the loop between eval results and the issue tracker. When an eval fails pre-deployment, it should automatically link to the issue that generated it — so the team can see whether the failure is a new regression or a known open issue.

When an eval passes after previously failing (i.e., a fix was deployed and the eval suite confirms it), the corresponding issue should be flagged for resolution verification. Human annotators then confirm whether the fix actually resolved the failure mode in production, or whether the eval was fooled.

This closed loop — production failure → annotated issue → generated eval → CI gate → resolution verification → annotation confirmation — is what distinguishes a production-aligned eval system from a static benchmark suite.

Scaling: Practical Considerations

Sampling rates

Running LLM-as-judge evals on 100% of production traffic is expensive and usually unnecessary. Use sampling: run evals on a configurable percentage of traces per evaluator. Start with 10–30% for most evaluators; reserve 100% sampling for evals covering your highest-severity failure modes (safety violations, critical incorrect information).

The right sampling rate depends on failure mode frequency. For a failure mode that appears in 5% of traces, a 10% sample will surface enough examples to track. For a failure mode that appears in 0.1% of traces, you need higher sampling or a targeted signal (like a rule-based pre-filter) to surface enough examples.

Annotation throughput

Human annotation is the rate-limiting step. A realistic throughput for one domain expert doing focused annotation is 50–150 traces per week, depending on trace complexity. Plan annotation cycles accordingly — 2 hours per week of focused annotation is enough to grow the eval suite meaningfully over time.

The bottleneck is almost never annotation time — it's annotation prioritization. Reviewers who have to dig through random traces to find failures burn out and produce lower-quality annotations. Tooling that surfaces the right traces first makes the same 2 hours produce 5–10x more signal.

Eval proliferation

Over time, eval suites accumulate evals for failure modes that are no longer active. An eval for a bug that was fixed 6 months ago and never recurred is noise — it slows down CI and adds cognitive overhead to result interpretation. Periodically prune evals whose corresponding issues are resolved and haven't been seen in production for a defined window (e.g., 90 days). Keep the suite focused on your current active failure mode profile.

Platform Comparison

Not all platforms support production-based eval generation. The distinction to look for is whether the platform connects observability, annotation, and eval creation in a single workflow — or whether these are separate tools you need to connect yourself.

| Platform | Annotation Queues | Auto Eval Generation | Eval Quality Tracking | Issue Lifecycle |
| --- | --- | --- | --- | --- |
| <strong>Latitude</strong> | Built-in, anomaly-prioritized | GEPA algorithm, auto from annotations | MCC score, tracked over time | Full lifecycle first sighting to resolution |
| Braintrust | Manual annotation only | None evals are manually authored | None | Topics (beta) unsupervised clustering only |
| Langfuse | 1 queue on free plan, limited | None fully manual workflow | Score analytics only, no quality metric | No concept of issue |
| LangSmith | Human annotation queues | None datasets built manually | Align Evals tool, but doesn’t persist over time | Insights (partial) no issue states |
| Galileo | Available via Signals | Partial Signals generate groupings, no eval creation | None | No concept of issue

| Platform | Annotation Queues | Auto Eval Generation | Eval Quality Tracking | Issue Lifecycle |
| --- | --- | --- | --- | --- |
| <strong>Latitude</strong> | Built-in, anomaly-prioritized | GEPA algorithm, auto from annotations | MCC score, tracked over time | Full lifecycle first sighting to resolution |
| Braintrust | Manual annotation only | None evals are manually authored | None | Topics (beta) unsupervised clustering only |
| Langfuse | 1 queue on free plan, limited | None fully manual workflow | Score analytics only, no quality metric | No concept of issue |
| LangSmith | Human annotation queues | None datasets built manually | Align Evals tool, but doesn’t persist over time | Insights (partial) no issue states |
| Galileo | Available via Signals | Partial Signals generate groupings, no eval creation | None | No concept of issue

| Platform | Annotation Queues | Auto Eval Generation | Eval Quality Tracking | Issue Lifecycle |
| --- | --- | --- | --- | --- |
| <strong>Latitude</strong> | Built-in, anomaly-prioritized | GEPA algorithm, auto from annotations | MCC score, tracked over time | Full lifecycle first sighting to resolution |
| Braintrust | Manual annotation only | None evals are manually authored | None | Topics (beta) unsupervised clustering only |
| Langfuse | 1 queue on free plan, limited | None fully manual workflow | Score analytics only, no quality metric | No concept of issue |
| LangSmith | Human annotation queues | None datasets built manually | Align Evals tool, but doesn’t persist over time | Insights (partial) no issue states |
| Galileo | Available via Signals | Partial Signals generate groupings, no eval creation | None | No concept of issue

The key differentiator is whether the platform has a concept of an issue — a tracked failure mode that persists from first sighting through annotation through eval generation through resolution. Without that, you're managing observability, annotation, and evals as separate tools and doing the integration work yourself.

Getting Started

If you have a production AI system and want to start generating evals from real data, here's the practical starting sequence:

  1. Instrument production traces. Connect your AI to an observability platform that captures full traces. Don't instrument a subset of calls — you need the full picture to identify failure patterns.

  2. Identify your top failure modes manually first. Before setting up annotation queues, review 50–100 production traces yourself. You'll find 3–5 recurring failure patterns quickly. Write them down as named categories.

  3. Set up annotation queues for those failure modes. Configure your annotation workflow to surface traces likely to contain those specific patterns. Give your domain experts a focused queue rather than a random sample.

  4. Annotate 10–20 examples per failure mode. This is enough signal to generate a first-generation eval. Quality at this stage matters more than quantity — a well-annotated trace is worth 10 hasty ones.

  5. Generate evals and validate quality. Run GEPA on the annotated data, check MCC scores, and add the evals to your CI pipeline. Block on evals with MCC above 0.6; monitor (don't block) on evals between 0.4 and 0.6.

  6. Run annotation cycles weekly. Two hours per week, focused on the highest-anomaly traces. Each cycle grows the annotation dataset, refines existing evals, and generates new ones for newly discovered failure modes.

The teams that reach stable production AI quality all converge on this loop. The specific tooling matters less than having all four steps connected — instrumentation, annotation, generation, and iteration — so that production failures automatically become future prevention.

Latitude is built around this loop. The free plan includes 5,000 traces per month and 500 trace scans — enough to instrument a production system, identify your first failure modes, and generate your first evals before spending anything. Start for free →

Frequently Asked Questions

How do you generate AI evaluations from production data?

Generating AI evaluations from production data follows four steps: (1) Instrument your production AI to capture full traces — all inputs, outputs, tool calls, and conversation turns. (2) Build an annotation workflow where domain experts review anomaly-prioritized traces and classify failure modes. (3) Use an algorithm like GEPA to automatically generate evaluations from those annotated failure modes — each annotation becomes the seed of a test that will catch similar failures in future deployments. (4) Measure eval quality using an alignment metric (like MCC) to verify that the generated evals actually correlate with human judgment. The key principle is that failure patterns live in production data, not in what your team can imagine ahead of time — generating evals from real failures produces a test suite that reflects your product's actual risk profile.

Why are synthetic benchmarks not enough for AI evaluation?

Synthetic benchmarks fail production AI teams for three reasons. First, they test what you anticipated — not what users actually do. Real user behavior reliably produces edge cases, ambiguous inputs, and usage patterns that no benchmark author predicts. Second, they go stale as your product evolves. A benchmark built for v1 of your agent doesn't reflect the failure modes introduced by v3. Third, they optimize for benchmark performance, not production reliability. Teams that evaluate primarily on synthetic benchmarks often find that benchmark scores improve while real user outcomes stay flat or degrade. Production-based evals solve all three problems: they capture the actual distribution of user behavior, update automatically as new failure modes appear, and are inherently aligned with production reality because they were generated from it.

What is GEPA and how does it auto-generate evals?

GEPA (Generative Evaluation Pipeline Algorithm) is Latitude's approach to automatically generating evaluations from human-annotated production failures. The process works as follows: domain experts annotate production traces to classify failure modes and define what "good" means in their specific context. GEPA takes those annotations and generates evaluation criteria that capture the same judgment — turning each annotated failure pattern into a reusable test. As more annotations are added, GEPA refines existing evals and generates new ones, so the eval suite grows automatically. The quality of each generated eval is measured using the Matthews Correlation Coefficient (MCC), which tracks how well the eval aligns with human annotations over time. This means you end up with an eval library that grows from real production issues and whose quality is continuously tracked — not a static set of manually written test cases.

How do you measure the quality of AI evaluations?

Eval quality measures whether an evaluator correctly identifies failures that humans would also flag. The most rigorous approach is the Matthews Correlation Coefficient (MCC), which measures the correlation between the evaluator's verdicts and human annotations. MCC is preferred over simple accuracy because it handles class imbalance — in most production datasets, "pass" cases significantly outnumber "fail" cases, and an evaluator that always returns "pass" would achieve high accuracy but zero predictive value. Beyond individual eval quality, track eval suite coverage: what percentage of your known active failure modes have a corresponding eval? Coverage gaps are as important as individual eval accuracy.

What is the difference between rule-based and LLM-as-judge evals?

Rule-based evals (assertions, regex, schema validation, deterministic checks) are fast, cheap, and perfectly reproducible — they always give the same verdict for the same input. They work well for structural requirements: did the agent return valid JSON? Did it call the right tool? LLM-as-judge evals use a language model to assess quality dimensions that can't be captured with rules — tone, coherence, alignment with product-specific success criteria. LLM-as-judge evals are more expensive and less deterministic but can assess semantic quality. Production eval pipelines typically use both: rule-based evals as a fast first pass, LLM-as-judge for quality dimensions that require semantic understanding. The key is calibrating LLM-as-judge evals against human annotations to ensure they're measuring what you actually care about.

Further Reading

Latitude's free plan is the fastest way to start the production-to-eval loop: instrument your traces, run the annotation queue for two weeks, and let GEPA generate your first eval suite from real failure modes. Get started free →

Related Blog Posts

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.