The Ultimate Guide to
A/B Testing

Ronny Kohavi

Former VP, Airbnb · Microsoft · Amazon
Author, Trustworthy Online Controlled Experiments

JUL 27 2023

Core Truth

Most Ideas Fail — and That's the Point

"92% of our Airbnb search experiments failed. That’s not a bad team — it’s the humbling reality. Every team that starts running experiments believes they’ll be different. Every team is humbled."

Microsoft overall: ~66% of ideas fail
Bing (optimised domain): ~85% failure rate
Airbnb search: 92% failure rate
Booking, Google Ads: 80–90% failure rate

Framework

The Experimentation Operating System

$100M

Bing revenue from moving one ad title line

200K

users needed before A/B testing reliably works

25K

experiments per year at Microsoft by 2019

OEC (Overall Evaluation Criterion): One composite metric causally tied to lifetime value — not revenue or engagement alone
Constraint optimisation: "Increase revenue within a fixed vertical pixel budget for ads" — frames the problem correctly
Guardrail metrics: Counter-metrics that catch short-term gains destroying long-term value (churn, session success rate)
Platform goal: Marginal cost of running one more experiment should approach zero
Institutional memory: Quarterly review of most surprising experiments — winners AND surprising losers

Ronny's key insight Optimise for revenue alone and you make more money short-term and destroy the product long-term. The OEC forces you to think in lifetime value, not quarterly metrics.

Deep Dive

Trust Is the Foundation — How to Build (and Break) It

Sample Ratio Mismatch (SRM): the silent killer

Design says 50/50 split — if you actually get 50.2/49.8 in a 1M-user experiment, that may be a 1-in-500,000 fluke: invalidate the result
8% of Microsoft experiments had SRM before the check existed — meaning 8% of results shipped on false data
Common causes: bots behave differently in treatment; data pipeline drops rows selectively; marketing campaign skews one arm
Teams ignored warning banners — Ronny blanked the entire scorecard, forcing a click-through with red highlights on every number

Twyman's Law "Any figure that looks interesting or different is usually wrong." If your metric normally moves 0.5% and yours shows 10%, hold the dinner. Nine out of ten: it's a bug, not a breakthrough.

P-values: the most misunderstood number in product

Most PMs read p < 0.05 as "95% chance we’re right." That is wrong.
Correct framing: apply Bayes’ rule using historical failure rate to get "false positive risk"
At Airbnb search (8% win rate), a p < 0.05 result still carries a 26% false positive risk — not 5%
Fix: require p < 0.01 AND replicate — combine experiments with Fisher’s or Stouffer’s method

Speed without trust = junk data

Optimizely’s early real-time p-value stopping inflated type-1 error from 5% to ~30%. Companies thought they were shipping winners. Most weren’t. Trust, once lost, is very hard to rebuild with an org.

Tactics

The Practical Experiment Playbook

200K user threshold: below that, only huge effects (5–10%) are detectable. Build culture first, infrastructure second.
OFAT — One Factor At A Time: never batch 17 changes into one launch. Of 17, ~4 will win. Ship those 4.
Portfolio thinking: 80% incremental (known wins compound); 20% big bets (expect 80% to fail)
Variance reduction: cap skewed metrics (nights booked > 30/month) to reach significance faster with fewer users
CUPED: use pre-experiment data to reduce variance — same unbiased result, fewer users needed
Don’t ship flat: statistically insignificant = zero user benefit + added maintenance cost — kill it
Legal ≠ skip testing: if a legal requirement forces a change, run 3 variants and ship the one that hurts least

goodui.org shortcut 140+ documented UI patterns with real A/B win rates. Check this before writing a brief — someone likely already tested your idea.

Contrarian

A/B Testing Myths Ronny Hears Every Week

✗ Experiments kill big swings — they trap you in incremental thinking INSTEAD → ✓ Experiments enable big bets by telling you fast if they work. The $100M Bing ad swap was a "meh" backlog item. Nobody predicted it. Only testing revealed it. The oracle is the experiment.

✗ Our team is better — we won’t see a 70% failure rate INSTEAD → ✓ Every team says this. Every team is humbled within months. Booking, Google, Airbnb, Microsoft — all 80–92% failure. When Ronny joined Microsoft, they said "we have better PMs." They didn’t.

✗ P < 0.05 means there’s a 95% chance my result is real INSTEAD → ✓ At a typical product team, p < 0.05 still carries a 15–26% false positive risk. Use "false positive risk" not p-value. Lower your threshold. Replicate experiments. Combine results statistically.

✗ Flat results should still ship — the team worked hard on it INSTEAD → ✓ Flat = zero user benefit + added code complexity + ongoing maintenance cost forever. Sunk-cost fallacy. Data-driven orgs reward honest kills. The experiment already told you: don’t ship.