Learn what Programmatic Rule Evaluations are, how they work in LLM evaluation, and when to use methods like exact match, ROUGE, regex, schema validation, and length checks to measure deterministic output quality.

César Miguelañez

Feb 23, 2026
What Are LLM Evaluations?
LLM evaluations are custom metrics that run after each prompt execution to measure whether the output succeeds or fails according to criteria you define.
There are three primary types of evaluations:
Programmatic Rule Evaluations
LLM-as-Judge Evaluations
Human-in-the-Loop Evaluations
Choosing the right evaluation type depends on:
What you are trying to measure
What your output looks like
How strict or flexible the criteria must be
Each evaluation type has advantages and tradeoffs depending on the use case. In this series, we will cover all three and explain when to use each one.
For more information about LLM evaluation in general read our comprehensive article on LLM evaluation.
We begin with Programmatic Rule Evaluations, which are often the most conceptually challenging, especially if you are not familiar with NLP techniques.
What Are Programmatic Rule Evaluations?
Programmatic Rule Evaluations run your model’s output through a deterministic algorithm to verify that it satisfies predefined criteria.
These evaluations fall into two categories:
Expected output: The reference answer you provide for a given input.
Actual output: The model’s response that is being evaluated.
<aside> ❗
The values for your expected output should come from a golden dataset. Learn more here: https://docs.latitude.so/guides/datasets/golden-datasets
</aside>
Algorithms That Require Expected Output
These methods compare the model’s output against a reference answer.
Exact Match
Lexical Overlap
Semantic Similarity
Numeric Similarity
If expected output is required, you must provide an example response that serves as the comparison baseline.
Exact Match
Exact Match is a binary similarity metric.
It checks whether the model’s output is identical to the expected output for the same input.
If identical → Pass
If not identical → Fail
This is the strictest possible evaluation.
Lexical Overlap
Lexical overlap measures surface similarity between the model output and the reference answer. It evaluates word or character overlap, not meaning.
Substring Matching
Checks whether the reference answer appears exactly inside the model output.
Use case:
Strict factual validation.
Example:
Reference:
Paris
Output:
The capital of France is Paris.
Result: Match
Levenshtein Distance
Measures how many character edits are required to transform one string into another.
Edits include:
Insertions
Deletions
Substitutions
Lower distance means higher similarity.
Example:
color vs colour → Distance = 1
Use case:
Handling small spelling or formatting differences.
ROUGE-1
Measures overlap of individual words between output and reference. Used with shorter responses.
Use case:
Basic content coverage.
ROUGE-2
Measures overlap of two-word sequences, also known as bigrams.
More strict than ROUGE-1 because it rewards correct phrasing, not just word presence.
Used with longer responses.
Example:
Reference:
The cat sat on the mat
Output:
The cat sat on mat
ROUGE-1 → High
ROUGE-2 → Lower
Algorithms That Do Not Require Expected Output
These methods evaluate structure or rule adherence without comparing to a reference answer.
Regular Expressions
Regular expression evaluations check whether output matches a predefined pattern.
They use standard RegEx syntax.
Reference: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Cheatsheet
Example:
This pattern matches a valid Gmail address.
Use case:
Ensuring that a model correctly formats tool inputs, such as email addresses.
Schema Validation
Schema validation checks whether the output conforms to a predefined JSON schema.
Example:
Use case:
Validating structured outputs such as tool calls or API responses.
Length Count
Length evaluations measure the size of the output. There are three variants:
Character Count
Counts total characters.
Example:
Good morning! How are you?
Length = 26 characters
Word Count
Counts total words.
Example:
Good morning! How are you?
Length = 5 words
Sentence Count
Counts total sentences.
Example:
Good morning! How are you?
Length = 2 sentences
When to Use a Programmatic Rule Evaluation
Programmatic rule evals are most useful when you have some understanding of exactly what the prompt will output and you would like to track the occurrence or lack thereof of a specific feature or failure mode of your prompt. Programmatic rule evals are not useful when you would like to do any assessment of the quality of language or the thoroughness of a response seeing as they cannot be used to detect semantic features of language.


