Layer 3: Fine-tuning — only when you have thousands of examples and clear improvement
Layer 4: Training — almost never needed for product teams
Chip's ruleIf you're fine-tuning before you've exhausted prompt engineering and RAG, you're solving the wrong problem.
Evals: The Critical Discipline
Measure What You Build
Evals = tests for AI systems: Unit tests for model behavior
What to evaluate: Correctness, safety, latency, cost, user satisfaction
How to build evals: Start with 20-50 examples your product gets wrong or right
The eval loop: Run evals before each model upgrade or prompt change
The eval trap
Most teams skip evals until things go wrong in production. By then, you have no baseline to fix from.
The human eval
Automated evals scale; human evals set the ground truth. You need both. Human evals define what good looks like.
Playbook
Ship Better AI Products
Build your eval framework before your 3rd AI feature
Start with 20 golden examples: 10 that should work, 10 edge cases
Use LLMs to help write evals — it's the AI engineering meta-loop
Publish your eval results internally — it creates accountability and surfaces regressions
The book insightChip's "AI Engineering" book is the first that bridges the gap between ML research and product-level AI engineering. Required reading for AI PMs.
Contrarian
AI Engineering Myths
✗Better model = better productINSTEAD →✓ Better evals = better product. Model quality matters; eval quality determines if you can tell the difference.
✗Prompt engineering is not real engineeringINSTEAD →✓ Prompt engineering is the highest-leverage skill in AI product development. Dismiss it and get outperformed.
✗Fine-tune on your data earlyINSTEAD →✓ Fine-tune after you've fully exploited prompt + RAG. Early fine-tuning is premature optimization.
✗AI products don't need testsINSTEAD →✓ AI products need MORE tests than traditional software because the failure modes are probabilistic.