>

Programmatic Rule Evaluations Explained

Programmatic Rule Evaluations Explained

Programmatic Rule Evaluations Explained

Learn what Programmatic Rule Evaluations are, how they work in LLM evaluation, and when to use methods like exact match, ROUGE, regex, schema validation, and length checks to measure deterministic output quality.

César Miguelañez

Feb 23, 2026

What Are LLM Evaluations?

LLM evaluations are custom metrics that run after each prompt execution to measure whether the output succeeds or fails according to criteria you define.

There are three primary types of evaluations:

  1. Programmatic Rule Evaluations

  2. LLM-as-Judge Evaluations

  3. Human-in-the-Loop Evaluations

Choosing the right evaluation type depends on:

  • What you are trying to measure

  • What your output looks like

  • How strict or flexible the criteria must be

Each evaluation type has advantages and tradeoffs depending on the use case. In this series, we will cover all three and explain when to use each one.

For more information about LLM evaluation in general read our comprehensive article on LLM evaluation.

We begin with Programmatic Rule Evaluations, which are often the most conceptually challenging, especially if you are not familiar with NLP techniques.

What Are Programmatic Rule Evaluations?

Programmatic Rule Evaluations run your model’s output through a deterministic algorithm to verify that it satisfies predefined criteria.

These evaluations fall into two categories:

  • Expected output: The reference answer you provide for a given input.

  • Actual output: The model’s response that is being evaluated.

<aside> ❗

The values for your expected output should come from a golden dataset. Learn more here: https://docs.latitude.so/guides/datasets/golden-datasets

</aside>

Algorithms That Require Expected Output

These methods compare the model’s output against a reference answer.

  • Exact Match

  • Lexical Overlap

  • Semantic Similarity

  • Numeric Similarity

If expected output is required, you must provide an example response that serves as the comparison baseline.

Exact Match

Exact Match is a binary similarity metric.

It checks whether the model’s output is identical to the expected output for the same input.

  • If identical → Pass

  • If not identical → Fail

This is the strictest possible evaluation.

Lexical Overlap

Lexical overlap measures surface similarity between the model output and the reference answer. It evaluates word or character overlap, not meaning.

Substring Matching

Checks whether the reference answer appears exactly inside the model output.

Use case:

Strict factual validation.

Example:

Reference:

Paris

Output:

The capital of France is Paris.

Result: Match

Levenshtein Distance

Measures how many character edits are required to transform one string into another.

Edits include:

  • Insertions

  • Deletions

  • Substitutions

Lower distance means higher similarity.

Example:

color vs colour → Distance = 1

Use case:

Handling small spelling or formatting differences.

ROUGE-1

Measures overlap of individual words between output and reference. Used with shorter responses.

Use case:

Basic content coverage.

ROUGE-2

Measures overlap of two-word sequences, also known as bigrams.

More strict than ROUGE-1 because it rewards correct phrasing, not just word presence.

Used with longer responses.

Example:

Reference:

The cat sat on the mat

Output:

The cat sat on mat

  • ROUGE-1 → High

  • ROUGE-2 → Lower

Algorithms That Do Not Require Expected Output

These methods evaluate structure or rule adherence without comparing to a reference answer.

Regular Expressions

Regular expression evaluations check whether output matches a predefined pattern.

They use standard RegEx syntax.

Reference: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Cheatsheet

Example:

^(?!\\.)(?!.*\\.\\.)([a-zA-Z0-9._%+-]{1,64})(?<!\\.)@gmail\\.com$
^(?!\\.)(?!.*\\.\\.)([a-zA-Z0-9._%+-]{1,64})(?<!\\.)@gmail\\.com$
^(?!\\.)(?!.*\\.\\.)([a-zA-Z0-9._%+-]{1,64})(?<!\\.)@gmail\\.com$

This pattern matches a valid Gmail address.

Use case:

Ensuring that a model correctly formats tool inputs, such as email addresses.

Schema Validation

Schema validation checks whether the output conforms to a predefined JSON schema.

Example:

{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "email": {
      "type": "string",
      "format": "email"
    },
    "subject": {
      "type": "string",
      "minLength": 1
    },
    "body": {
      "type": "string",
      "minLength": 1
    },
    "priority": {
      "type": "string",
      "enum": ["low", "medium", "high"]
    }
  },
  "required": ["email", "subject", "body"]
}
{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "email": {
      "type": "string",
      "format": "email"
    },
    "subject": {
      "type": "string",
      "minLength": 1
    },
    "body": {
      "type": "string",
      "minLength": 1
    },
    "priority": {
      "type": "string",
      "enum": ["low", "medium", "high"]
    }
  },
  "required": ["email", "subject", "body"]
}
{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "email": {
      "type": "string",
      "format": "email"
    },
    "subject": {
      "type": "string",
      "minLength": 1
    },
    "body": {
      "type": "string",
      "minLength": 1
    },
    "priority": {
      "type": "string",
      "enum": ["low", "medium", "high"]
    }
  },
  "required": ["email", "subject", "body"]
}

Use case:

Validating structured outputs such as tool calls or API responses.

Length Count

Length evaluations measure the size of the output. There are three variants:

Character Count

Counts total characters.

Example:

Good morning! How are you?

Length = 26 characters

Word Count

Counts total words.

Example:

Good morning! How are you?

Length = 5 words

Sentence Count

Counts total sentences.

Example:

Good morning! How are you?

Length = 2 sentences

When to Use a Programmatic Rule Evaluation

Programmatic rule evals are most useful when you have some understanding of exactly what the prompt will output and you would like to track the occurrence or lack thereof of a specific feature or failure mode of your prompt. Programmatic rule evals are not useful when you would like to do any assessment of the quality of language or the thoroughness of a response seeing as they cannot be used to detect semantic features of language.

Recent articles

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.

Build reliable AI.

Latitude Data S.L. 2026

All rights reserved.