AI Evals Are the Hottest New Skill for Product Builders
Hamel Husain & Shreya Shankar
AI engineers; eval experts; researchers
SEP 25 2025
The Thesis
Evals Are the Unit Tests of AI Products
"The difference between AI products that win and those that fail in production: evals. Not the model, not the prompt — the evaluation framework."
Evals = systematic testing of AI output quality
Most teams skip evals and learn about failures from users — that's backwards
Good evals enable fast iteration: change → eval → confidence → ship
The eval skill is cross-functional: PMs, engineers, and domain experts all need it
Framework
Building Your Eval System
20
golden examples to start
3 types
of evals: unit, integration, human
10×
faster model upgrades with eval coverage
Step 1: Collect 20-50 examples of good and bad outputs
Step 2: Write explicit criteria: what makes an output good or bad?
Step 3: Use LLMs to auto-evaluate against your criteria
Step 4: Human review on a sample to calibrate the LLM evaluator
Hamel's ruleYour eval dataset IS your product specification. If you can't write it, you don't know what you're building.
Eval Types
The Three Levels of AI Evals
Unit evals: Does the AI output match the expected output for a specific input?
Integration evals: Does the full pipeline (RAG + LLM + post-processing) produce good results?
Human evals: Do real users find the output useful, accurate, and safe?
Automation: Use LLM-as-judge for unit evals; human panels for integration and human evals
LLM-as-judge
Use GPT-4 or Claude to evaluate outputs at scale. It's 80% as good as human eval at 1% of the cost.
The calibration step
Always validate your automated eval against 50 human judgments. Miscalibrated evals produce false confidence.
Playbook
Build Your Eval Framework
Write your first eval dataset before you finalize your first AI prompt
Define "good" explicitly: a rubric beats a gut feeling every time
Automate the high-volume evals; keep humans for the edge cases
Run evals before every model upgrade — it's your regression test suite
The eval ROITeams with eval frameworks ship AI model upgrades 5× faster because they have confidence instead of anxiety.
Contrarian
Eval Myths
✗We'll know if it's bad from usersINSTEAD →✓ You'll know after damage is done. Evals catch failures before users see them.
✗Evals are too expensive to buildINSTEAD →✓ A basic eval framework takes 2 days. The cost of skipping it is measured in user trust.
✗The model is the most important variableINSTEAD →✓ Your eval framework is more important than your model choice. You can't improve what you don't measure.
✗Human evals are the gold standardINSTEAD →✓ Human evals define the gold standard. Automated evals scale that standard. Both are required.