The 100-Person Lab That Became
AI's Secret Weapon

Edwin Chen

Founder & CEO, Surge AI; teaches AI right from wrong

DEC 7 2025

The Thesis

Human Feedback IS
the AI Moat

"Anthropic and Google's models are as good as they are because of the quality of human feedback they were trained on. That quality is Surge."

RLHF quality determines model quality more than architecture choices
The labelers are not a commodity — domain expertise + good judgment = rare combination
Surge built the expert network that taught AI what's good and bad
The invisible layer: most people don't know this work exists, but it determines what AI does

Framework

$500M+

RLHF market in 2025

100K+

expert labelers globally

10×

output quality gap between good/bad RLHF

Not just "is this good?" — expert raters understand nuance, context, and edge cases
Domain expertise matters: medical RLHF requires doctors, legal requires lawyers
Scale + quality: Surge's bet is expert-quality feedback at production scale
The eval loop: better feedback → better model → harder evals → need better feedback

The hidden variableThe most important people in AI development are often not the researchers or the engineers — they're the expert labelers.

The RLHF Industry

Volume: Frontier model training requires millions of human preference judgments
Quality: Bad RLHF produces confidently wrong, helpful-sounding AI
Specialization: Code RLHF, reasoning RLHF, safety RLHF require different experts
The problem: Most AI companies underinvest in RLHF quality — and it shows

Why Claude feels different

Claude's helpfulness and safety balance comes from deliberate, expert human feedback — not just prompting.

The eval arms race

As models get better, the humans rating them need to be smarter. The bar keeps rising.

Playbook

Ask: where does the AI's "judgment" come from? Trace it to the training data.
Build eval datasets with real domain experts, not just crowdsourced workers
The best product insight: what does the AI confidently get wrong in your domain?
Invest in evals before you invest in prompts — you can't improve what you can't measure

Edwin's missionSurge exists because good AI requires good humans. The quality of human feedback is the quality of the AI.

Contrarian

✗Bigger models need less human feedbackINSTEAD →✓ Bigger models need better human feedback. The bar rises with model capability.

✗Crowdsourced labeling is fineINSTEAD →✓ Crowdsourced labeling is fine for simple tasks. Expert labeling is required for nuanced judgment.

✗RLHF is just annotationINSTEAD →✓ RLHF is judgment at scale. The best labelers are the world's most underpaid AI researchers.

✗Models will soon self-improve without humansINSTEAD →✓ Models self-improve with AI feedback loops. But the root signal is still human values.

0:00