Why AI Evals Are the Hottest New Skill for Product Builders
Hamel Husain & Shreya Shankar
Creators of the #1 Evals Course on Maven 2,000+ PMs & engineers trained · 500 companies
SEP 25 2025
The Hook
Evals = Data Analytics for Your AI Product
"Evals is a way to systematically measure and improve an AI application. At its core, it's data analytics on your LLM application." — Hamel Husain
Not a single tool — a broad spectrum of measurement approaches
The highest ROI activity when building an AI product
Goes from chaos (vibe checks) to a confident, iteratable feedback signal
Framework
The 4-Step Eval Flywheel
100
traces to review first
4–7
LLM judges per product
30m
per week after setup
Open Coding: Manually read 100 traces, write a short freeform note on the first thing you see wrong. Human-only — an LLM can't catch what it doesn't know is wrong.
Axial Coding: Use an LLM to cluster your notes into failure-mode categories (e.g. "human handoff issue," "conversational flow error"). Refine until categories are actionable.
Count & Prioritize: Pivot-table the categories. Most frequent isn't always most important — a rare but business-critical failure may rank higher.
Build an LLM Judge: Write a binary pass/fail prompt for your top failure mode. Validate it against your own manual labels before shipping it to production.
The benevolent dictator rule
Appoint one domain expert to own open coding. Committees kill speed. The PM is usually the right person — they have product context that engineers lack.
Playbook
Building & Validating Your LLM-as-Judge
One failure mode per judge. Never ask an LLM to evaluate everything at once. Scope it to one narrow, specific failure per prompt.
Always binary. Pass or fail — never a 1–5 scale. Nobody knows what 3.2 vs 3.7 means. A binary decision forces clarity and actionability.
Validate against human labels. Build a confusion matrix: human said pass/fail vs judge said pass/fail. Iterate until the off-diagonal errors approach zero.
Don't just report agreement rate. 90% agreement looks great — but if errors only occur 10% of the time, a judge that always says "pass" hits 90% agreement and is useless. Use the confusion matrix.
Run it online, not just offline. Sample 1,000 production traces every day. Your LLM judge becomes a live quality dashboard — not just a CI gate.
Code-based evals first. If you can check a failure with Python (JSON valid? String length under N? Contains keyword X?), do that. Cheaper and faster than an LLM judge.
Theoretical saturation
Stop reviewing traces when you stop finding new failure types. For most products, this happens around 40–100 traces. You develop intuition for this quickly — just start.
The "none of the above" trick
When using an LLM to categorize open codes, always include "none of the above" as a valid label. Any traces landing there mean your axial code taxonomy is incomplete — go back and refine.
Evals are living PRDs
Your LLM judge prompt is a machine-readable product requirements document. It precisely describes how your product should behave — and it runs automatically, forever, on real traffic.
Tactics
Your First Eval This Week
Pick one AI feature in your product. Pull 50–100 traces from your logs today.
Open your observability tool (Braintrust, Phoenix Arize, or LangSmith). Read each trace, write one freeform note on the first thing wrong. Stop at the first issue per trace, then move on.
Drop the notes into Claude or ChatGPT: "Cluster these open codes into actionable axial codes — failure categories." Refine until they're specific enough to act on.
Count failures per category in a pivot table. Pick the top one that isn't a simple prompt fix.
Write a binary pass/fail LLM judge prompt for that failure. Validate against your manual labels. Hook it into CI and a weekly monitoring cron job.
Time investment
3–4 days upfront for initial error analysis and your first judge. Then 30 minutes per week to review new traces. One-time cost, permanent signal — and you'll immediately want to keep doing it.
Contrarian
Evals Myths Debunked
✗"Just plug in an AI eval tool — it'll do the analysis for you"INSTEAD →✓ An LLM can't catch what it doesn't know is wrong. A model reviewing a real estate agent trace will say "looks great" when it hallucinated a virtual tour — because it has no context that feature doesn't exist. Human domain expertise is irreplaceable in open coding.
✗"We dogfood and vibe check — we don't need evals"INSTEAD →✓ Dogfooding is evals in disguise. Claude Code's "no evals" stance ignores the massive eval infrastructure the base Claude model is built on. All great AI products do systematic error analysis — most just don't call it that.
✗"Score your LLM outputs on a 1–5 scale for more nuance"INSTEAD →✓ Likert-scale LLM judges are the #1 way to lose stakeholder trust in evals. A score of 3.2 vs 3.7 means nothing to anyone. Force a binary decision — "Is this a handoff failure: yes or no?" That you can actually act on.
✗"Skip error analysis — just write evals based on your PRD assumptions"INSTEAD →✓ You can't dream up your real failure modes from a spec. The most common production errors are ones nobody anticipated: garbled SMS inputs, hallucinated product features, silent tool calls that abandon users. Look at your actual data first, always.