▣FEBRUARY 23, 2026

What Are LLM Evaluations?

LLM evaluations are custom metrics that run after each prompt execution to measure whether the output succeeds or fails according to criteria you define.

There are three primary types of evaluations:

Programmatic Rule Evaluations
LLM-as-Judge Evaluations
Human-in-the-Loop Evaluations

Choosing the right evaluation type depends on:

What you are trying to measure
What your output looks like
How strict or flexible the criteria must be

Each evaluation type has advantages and tradeoffs depending on the use case. In this series, we will cover all three and explain when to use each one.

For more information about LLM evaluation in general read our comprehensive article on LLM evaluation.

We begin with Programmatic Rule Evaluations , which are often the most conceptually challenging, especially if you are not familiar with NLP techniques.

What Are Programmatic Rule Evaluations?

Programmatic Rule Evaluations run your model’s output through a deterministic algorithm to verify that it satisfies predefined criteria.

These evaluations fall into two categories:

Expected output : The reference answer you provide for a given input.
Actual output : The model’s response that is being evaluated.

Algorithms That Require Expected Output

These methods compare the model’s output against a reference answer.

Exact Match
Lexical Overlap
Semantic Similarity
Numeric Similarity

If expected output is required, you must provide an example response that serves as the comparison baseline.

Exact Match

Exact Match is a binary similarity metric.

It checks whether the model’s output is identical to the expected output for the same input.

If identical → Pass
If not identical → Fail

This is the strictest possible evaluation.

Lexical Overlap

Lexical overlap measures surface similarity between the model output and the reference answer. It evaluates word or character overlap, not meaning.

Substring Matching

Checks whether the reference answer appears exactly inside the model output.

Use case:

Strict factual validation.

Example:

Reference:

Paris

Output:

The capital of France is Paris.

Result: Match

Levenshtein Distance

Measures how many character edits are required to transform one string into another.

Edits include:

Insertions
Deletions
Substitutions

Lower distance means higher similarity.

Example:

color vs colour → Distance = 1

Use case:

Handling small spelling or formatting differences.

ROUGE-1

Measures overlap of individual words between output and reference. Used with shorter responses.

Use case:

Basic content coverage.

ROUGE-2

Measures overlap of two-word sequences, also known as bigrams.

More strict than ROUGE-1 because it rewards correct phrasing, not just word presence.

Used with longer responses.

Example:

Reference:

The cat sat on the mat

Output:

The cat sat on mat

ROUGE-1 → High
ROUGE-2 → Lower

Algorithms That Do Not Require Expected Output

These methods evaluate structure or rule adherence without comparing to a reference answer.

Regular Expressions

Regular expression evaluations check whether output matches a predefined pattern.

They use standard RegEx syntax.

Reference: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Cheatsheet

Example:

^(?!\\.)(?!.*\\.\\.)([a-zA-Z0-9._%+-]{1,64})(?<!\\.)@gmail\\.com$

This pattern matches a valid Gmail address.

Use case:

Ensuring that a model correctly formats tool inputs, such as email addresses.

Schema Validation

Schema validation checks whether the output conforms to a predefined JSON schema.

Example:

{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "email": {
      "type": "string",
      "format": "email"
    },
    "subject": {
      "type": "string",
      "minLength": 1
    },
    "body": {
      "type": "string",
      "minLength": 1
    },
    "priority": {
      "type": "string",
      "enum": ["low", "medium", "high"]
    }
  },
  "required": ["email", "subject", "body"]
}

Use case:

Validating structured outputs such as tool calls or API responses.

Length Count

Length evaluations measure the size of the output. There are three variants:

Character Count

Counts total characters.

Example:

Good morning! How are you?

Length = 26 characters

Word Count

Counts total words.

Example:

Good morning! How are you?

Length = 5 words

Sentence Count

Counts total sentences.

Example:

Good morning! How are you?

Length = 2 sentences

When to Use a Programmatic Rule Evaluation

Programmatic rule evals are most useful when you have some understanding of exactly what the prompt will output and you would like to track the occurrence or lack thereof of a specific feature or failure mode of your prompt. Programmatic rule evals are not useful when you would like to do any assessment of the quality of language or the thoroughness of a response seeing as they cannot be used to detect semantic features of language.

What Are LLM Evaluations?

What Are Programmatic Rule Evaluations?

Algorithms That Require Expected Output

Exact Match

Lexical Overlap

Substring Matching

Levenshtein Distance

ROUGE-1

ROUGE-2

Algorithms That Do Not Require Expected Output

Regular Expressions

Schema Validation

Length Count

Character Count

Word Count

Sentence Count

When to Use a Programmatic Rule Evaluation

Related Blog Posts