← All Episodes
Based on Lenny's Podcast data
Lenny's Knowledge Sketch

The 100-Person Lab That Became
AI's Secret Weapon

Edwin Chen
Founder & CEO, Surge AI; teaches AI right from wrong
DEC 7 2025
The Thesis

Human Feedback IS
the AI Moat

HUMAN JUDGMENTRLHFALIGNED AI
"Anthropic and Google's models are as good as they are because of the quality of human feedback they were trained on. That quality is Surge."
  • RLHF quality determines model quality more than architecture choices
  • The labelers are not a commodity — domain expertise + good judgment = rare combination
  • Surge built the expert network that taught AI what's good and bad
  • The invisible layer: most people don't know this work exists, but it determines what AI does
Framework

How RLHF Actually Works

RAW MODELHUMAN FEEDBACKALIGNED MODEL
$500M+
RLHF market in 2025
100K+
expert labelers globally
10×
output quality gap between good/bad RLHF
  • Not just "is this good?" — expert raters understand nuance, context, and edge cases
  • Domain expertise matters: medical RLHF requires doctors, legal requires lawyers
  • Scale + quality: Surge's bet is expert-quality feedback at production scale
  • The eval loop: better feedback → better model → harder evals → need better feedback
The hidden variableThe most important people in AI development are often not the researchers or the engineers — they're the expert labelers.
The RLHF Industry

What Most People Don't Know

  • Volume: Frontier model training requires millions of human preference judgments
  • Quality: Bad RLHF produces confidently wrong, helpful-sounding AI
  • Specialization: Code RLHF, reasoning RLHF, safety RLHF require different experts
  • The problem: Most AI companies underinvest in RLHF quality — and it shows
Why Claude feels different

Claude's helpfulness and safety balance comes from deliberate, expert human feedback — not just prompting.

The eval arms race

As models get better, the humans rating them need to be smarter. The bar keeps rising.

Playbook

Think About AI Quality

  • Ask: where does the AI's "judgment" come from? Trace it to the training data.
  • Build eval datasets with real domain experts, not just crowdsourced workers
  • The best product insight: what does the AI confidently get wrong in your domain?
  • Invest in evals before you invest in prompts — you can't improve what you can't measure
Edwin's missionSurge exists because good AI requires good humans. The quality of human feedback is the quality of the AI.
Contrarian

AI Quality Myths

Bigger models need less human feedbackINSTEAD →Bigger models need better human feedback. The bar rises with model capability.
Crowdsourced labeling is fineINSTEAD →Crowdsourced labeling is fine for simple tasks. Expert labeling is required for nuanced judgment.
RLHF is just annotationINSTEAD →RLHF is judgment at scale. The best labelers are the world's most underpaid AI researchers.
Models will soon self-improve without humansINSTEAD →Models self-improve with AI feedback loops. But the root signal is still human values.
𝕏︎ X / Twitterin LinkedIn📸 Instagram🔗 Copy link
0:00