← All Episodes
Based on Lenny's Podcast data
Lenny's Knowledge Sketch · Experimentation

The Ultimate Guide to
A/B Testing

Ronny Kohavi
Former VP, Airbnb · Microsoft · Amazon
Author, Trustworthy Online Controlled Experiments
JUL 27 2023
Core Truth

Most Ideas Fail — and That's the Point

34% MSFT 15% BING 8% AIRBNB WIN RATE
"92% of our Airbnb search experiments failed. That’s not a bad team — it’s the humbling reality. Every team that starts running experiments believes they’ll be different. Every team is humbled."
  • Microsoft overall: ~66% of ideas fail
  • Bing (optimised domain): ~85% failure rate
  • Airbnb search: 92% failure rate
  • Booking, Google Ads: 80–90% failure rate
Framework

The Experimentation Operating System

REVENUE ↑ USER TIME ↓ RETENTION ↑ OEC LIFETIME VALUE SHIP / NO SHIP
$100M
Bing revenue from moving one ad title line
200K
users needed before A/B testing reliably works
25K
experiments per year at Microsoft by 2019
  • OEC (Overall Evaluation Criterion): One composite metric causally tied to lifetime value — not revenue or engagement alone
  • Constraint optimisation: "Increase revenue within a fixed vertical pixel budget for ads" — frames the problem correctly
  • Guardrail metrics: Counter-metrics that catch short-term gains destroying long-term value (churn, session success rate)
  • Platform goal: Marginal cost of running one more experiment should approach zero
  • Institutional memory: Quarterly review of most surprising experiments — winners AND surprising losers
Ronny's key insight Optimise for revenue alone and you make more money short-term and destroy the product long-term. The OEC forces you to think in lifetime value, not quarterly metrics.
Deep Dive

Trust Is the Foundation — How to Build (and Break) It

Sample Ratio Mismatch (SRM): the silent killer

  • Design says 50/50 split — if you actually get 50.2/49.8 in a 1M-user experiment, that may be a 1-in-500,000 fluke: invalidate the result
  • 8% of Microsoft experiments had SRM before the check existed — meaning 8% of results shipped on false data
  • Common causes: bots behave differently in treatment; data pipeline drops rows selectively; marketing campaign skews one arm
  • Teams ignored warning banners — Ronny blanked the entire scorecard, forcing a click-through with red highlights on every number
Twyman's Law "Any figure that looks interesting or different is usually wrong." If your metric normally moves 0.5% and yours shows 10%, hold the dinner. Nine out of ten: it's a bug, not a breakthrough.

P-values: the most misunderstood number in product

  • Most PMs read p < 0.05 as "95% chance we’re right." That is wrong.
  • Correct framing: apply Bayes’ rule using historical failure rate to get "false positive risk"
  • At Airbnb search (8% win rate), a p < 0.05 result still carries a 26% false positive risk — not 5%
  • Fix: require p < 0.01 AND replicate — combine experiments with Fisher’s or Stouffer’s method
Speed without trust = junk data

Optimizely’s early real-time p-value stopping inflated type-1 error from 5% to ~30%. Companies thought they were shipping winners. Most weren’t. Trust, once lost, is very hard to rebuild with an org.

Tactics

The Practical Experiment Playbook

  • 200K user threshold: below that, only huge effects (5–10%) are detectable. Build culture first, infrastructure second.
  • OFAT — One Factor At A Time: never batch 17 changes into one launch. Of 17, ~4 will win. Ship those 4.
  • Portfolio thinking: 80% incremental (known wins compound); 20% big bets (expect 80% to fail)
  • Variance reduction: cap skewed metrics (nights booked > 30/month) to reach significance faster with fewer users
  • CUPED: use pre-experiment data to reduce variance — same unbiased result, fewer users needed
  • Don’t ship flat: statistically insignificant = zero user benefit + added maintenance cost — kill it
  • Legal ≠ skip testing: if a legal requirement forces a change, run 3 variants and ship the one that hurts least
goodui.org shortcut 140+ documented UI patterns with real A/B win rates. Check this before writing a brief — someone likely already tested your idea.
Contrarian

A/B Testing Myths Ronny Hears Every Week

Experiments kill big swings — they trap you in incremental thinking INSTEAD → Experiments enable big bets by telling you fast if they work. The $100M Bing ad swap was a "meh" backlog item. Nobody predicted it. Only testing revealed it. The oracle is the experiment.
Our team is better — we won’t see a 70% failure rate INSTEAD → Every team says this. Every team is humbled within months. Booking, Google, Airbnb, Microsoft — all 80–92% failure. When Ronny joined Microsoft, they said "we have better PMs." They didn’t.
P < 0.05 means there’s a 95% chance my result is real INSTEAD → At a typical product team, p < 0.05 still carries a 15–26% false positive risk. Use "false positive risk" not p-value. Lower your threshold. Replicate experiments. Combine results statistically.
Flat results should still ship — the team worked hard on it INSTEAD → Flat = zero user benefit + added code complexity + ongoing maintenance cost forever. Sunk-cost fallacy. Data-driven orgs reward honest kills. The experiment already told you: don’t ship.
𝕏︎ X / Twitterin LinkedIn📸 Instagram🔗 Copy link
0:00