AI Evals Are the Hottest
New Skill for Product Builders

Hamel Husain & Shreya Shankar

AI engineers; eval experts; researchers

SEP 25 2025

The Thesis

Evals Are the Unit Tests
of AI Products

"The difference between AI products that win and those that fail in production: evals. Not the model, not the prompt — the evaluation framework."

Evals = systematic testing of AI output quality
Most teams skip evals and learn about failures from users — that's backwards
Good evals enable fast iteration: change → eval → confidence → ship
The eval skill is cross-functional: PMs, engineers, and domain experts all need it

Framework

Building Your Eval System

golden examples to start

3 types

of evals: unit, integration, human

10×

faster model upgrades with eval coverage

Step 1: Collect 20-50 examples of good and bad outputs
Step 2: Write explicit criteria: what makes an output good or bad?
Step 3: Use LLMs to auto-evaluate against your criteria
Step 4: Human review on a sample to calibrate the LLM evaluator

Hamel's ruleYour eval dataset IS your product specification. If you can't write it, you don't know what you're building.

Eval Types

The Three Levels of AI Evals

Unit evals: Does the AI output match the expected output for a specific input?
Integration evals: Does the full pipeline (RAG + LLM + post-processing) produce good results?
Human evals: Do real users find the output useful, accurate, and safe?
Automation: Use LLM-as-judge for unit evals; human panels for integration and human evals

LLM-as-judge

Use GPT-4 or Claude to evaluate outputs at scale. It's 80% as good as human eval at 1% of the cost.

The calibration step

Always validate your automated eval against 50 human judgments. Miscalibrated evals produce false confidence.

Playbook

Build Your Eval Framework

Write your first eval dataset before you finalize your first AI prompt
Define "good" explicitly: a rubric beats a gut feeling every time
Automate the high-volume evals; keep humans for the edge cases
Run evals before every model upgrade — it's your regression test suite

The eval ROITeams with eval frameworks ship AI model upgrades 5× faster because they have confidence instead of anxiety.

Contrarian

Eval Myths

✗We'll know if it's bad from usersINSTEAD →✓ You'll know after damage is done. Evals catch failures before users see them.

✗Evals are too expensive to buildINSTEAD →✓ A basic eval framework takes 2 days. The cost of skipping it is measured in user trust.

✗The model is the most important variableINSTEAD →✓ Your eval framework is more important than your model choice. You can't improve what you don't measure.

✗Human evals are the gold standardINSTEAD →✓ Human evals define the gold standard. Automated evals scale that standard. Both are required.

0:00

Evals Are the Unit Testsof AI Products

Building Your Eval System

The Three Levels of AI Evals

Build Your Eval Framework

Eval Myths

Evals Are the Unit Tests
of AI Products