Choosing an AI model for an AI evaluation.

César Miguelañez

Perhaps you feel a natural connection to one LLM or another, it is only natural. You've been using GPT-4.5 for months and the responses it produces just make sense to you. Maybe you're a Claude user and you adore its elegant thoughts. That's nice, poetic even, but you should leave such sap for your helpful chatbot and take a more sane approach when choosing models for LLM products.
The truth of the matter is that models each do different things better or worse than their peers. These differences are what benchmarks seek to delineate but, as any in-the-muck user of AI knows, don't mean jack shit in practice. No, what matters in real life is the feel of the model, that illusory je-ne-sais-quoi you sense when you see the difference in approach to a problem between Claude Opus and GPT-4.5. They might have the same benchmark scores in coding but you know you're going to choose Opus every single time for a big build. These are the types of intuitions I'm going to try to give you for something a little different today: LLM evals.
LLM evals are the same thing as a prompt. They have an input, instructions, a model, a temperature, and an output. The idea is to have a second critical eye looking at your product's behavior and reporting whether it is successfully carrying out its task.
The first thing to get out of your head is that your evaluator needs to be the best model on the market. It doesn't. What it needs is a very specific set of qualities that are different from what you want in your product model.
Consistency above all else. An evaluator that gives you a 7 one day and a 4 the next for the same output is worse than useless. You want a model that is almost boring in how predictably it applies a rubric. This means you should be running the same eval prompt a few times on the same sample and checking variance before you commit to a model. High temperature (above 0.7), creative, expressive models are often terrible evaluators for exactly this reason.
Instruction-following over raw intelligence. Your eval prompt is going to have a rubric, criteria, and maybe a scoring scale. The model needs to follow that structure to the letter: return JSON when you ask for JSON, score on a 1-5 when you say 1-5, not wander off and write you an essay about why the output was philosophically interesting. In practice this means smaller, fine-tuned instruction-following models often outperform frontier models in this role. GPT-mini models and Haiku have won many an eval pipeline over their bigger siblings for exactly this reason.
Cost matters more than you think. You are not running one eval. You are running thousands, maybe millions. A model that costs 10x more and performs 5% better on your rubric is a bad trade. Do the math early so you don’t accidently break the bank.
Context window and output length. If your product outputs are long, multi-turn conversations, long documents, your evaluator needs to hold all of that in context comfortably and still have room to reason about it. This is one area where a frontier model sometimes earns its keep. The larger/ more complex the agent, the more strategy you need when picking an evaluator model.
Can it handle your rubric's complexity? A simple binary pass/fail on tone? Almost any model will do. A nuanced multi-axis rubric judging factual accuracy, helpfulness, and safety simultaneously? Now you need something with real reasoning chops. Be honest about how hard your rubric actually is before you over- or under-engineer your choice. Start small and increase as necessary.
Not everything needs an LLM looking at it. In fact, one of the more expensive mistakes teams make is reaching for a model when something much dumber would do the job better, faster, and cheaper.
When a rule can answer the question. Did the model return valid JSON? Does the output contain a phone number? Is the response under 200 words? These are not LLM eval questions code will do this much better. A regex that runs in microseconds for free will always beat a model call on questions that have a deterministic answer.
TLDR; if you can write a test for it, write a test, go for an LLM eval when the judgment required is genuinely too fuzzy, contextual, or multidimensional for anything simpler to handle.


