← All Episodes
Based on Lenny's Podcast data
Lenny's Knowledge Sketch

AI Evals Are the Hottest
New Skill for Product Builders

Hamel Husain & Shreya Shankar
AI engineers; eval experts; researchers
SEP 25 2025
The Thesis

Evals Are the Unit Tests
of AI Products

No evals Ship fastGood evals Ship fastNo evals Ship slowGood evals Ship slow
"The difference between AI products that win and those that fail in production: evals. Not the model, not the prompt — the evaluation framework."
  • Evals = systematic testing of AI output quality
  • Most teams skip evals and learn about failures from users — that's backwards
  • Good evals enable fast iteration: change → eval → confidence → ship
  • The eval skill is cross-functional: PMs, engineers, and domain experts all need it
Framework

Building Your Eval System

EXAMPLESCRITERIAAUTOMATION
20
golden examples to start
3 types
of evals: unit, integration, human
10×
faster model upgrades with eval coverage
  • Step 1: Collect 20-50 examples of good and bad outputs
  • Step 2: Write explicit criteria: what makes an output good or bad?
  • Step 3: Use LLMs to auto-evaluate against your criteria
  • Step 4: Human review on a sample to calibrate the LLM evaluator
Hamel's ruleYour eval dataset IS your product specification. If you can't write it, you don't know what you're building.
Eval Types

The Three Levels of AI Evals

  • Unit evals: Does the AI output match the expected output for a specific input?
  • Integration evals: Does the full pipeline (RAG + LLM + post-processing) produce good results?
  • Human evals: Do real users find the output useful, accurate, and safe?
  • Automation: Use LLM-as-judge for unit evals; human panels for integration and human evals
LLM-as-judge

Use GPT-4 or Claude to evaluate outputs at scale. It's 80% as good as human eval at 1% of the cost.

The calibration step

Always validate your automated eval against 50 human judgments. Miscalibrated evals produce false confidence.

Playbook

Build Your Eval Framework

  • Write your first eval dataset before you finalize your first AI prompt
  • Define "good" explicitly: a rubric beats a gut feeling every time
  • Automate the high-volume evals; keep humans for the edge cases
  • Run evals before every model upgrade — it's your regression test suite
The eval ROITeams with eval frameworks ship AI model upgrades 5× faster because they have confidence instead of anxiety.
Contrarian

Eval Myths

We'll know if it's bad from usersINSTEAD →You'll know after damage is done. Evals catch failures before users see them.
Evals are too expensive to buildINSTEAD →A basic eval framework takes 2 days. The cost of skipping it is measured in user trust.
The model is the most important variableINSTEAD →Your eval framework is more important than your model choice. You can't improve what you don't measure.
Human evals are the gold standardINSTEAD →Human evals define the gold standard. Automated evals scale that standard. Both are required.
𝕏︎ X / Twitterin LinkedIn📸 Instagram🔗 Copy link
0:00