Former VP, Airbnb · Microsoft · Amazon Author, Trustworthy Online Controlled Experiments
JUL 27 2023
Core Truth
Most Ideas Fail — and That's the Point
"92% of our Airbnb search experiments failed. That’s not a bad team — it’s the humbling reality. Every team that starts running experiments believes they’ll be different. Every team is humbled."
Microsoft overall: ~66% of ideas fail
Bing (optimised domain): ~85% failure rate
Airbnb search: 92% failure rate
Booking, Google Ads: 80–90% failure rate
Framework
The Experimentation Operating System
$100M
Bing revenue from moving one ad title line
200K
users needed before A/B testing reliably works
25K
experiments per year at Microsoft by 2019
OEC (Overall Evaluation Criterion): One composite metric causally tied to lifetime value — not revenue or engagement alone
Constraint optimisation: "Increase revenue within a fixed vertical pixel budget for ads" — frames the problem correctly
Guardrail metrics: Counter-metrics that catch short-term gains destroying long-term value (churn, session success rate)
Platform goal: Marginal cost of running one more experiment should approach zero
Institutional memory: Quarterly review of most surprising experiments — winners AND surprising losers
Ronny's key insight
Optimise for revenue alone and you make more money short-term and destroy the product long-term. The OEC forces you to think in lifetime value, not quarterly metrics.
Deep Dive
Trust Is the Foundation — How to Build (and Break) It
Sample Ratio Mismatch (SRM): the silent killer
Design says 50/50 split — if you actually get 50.2/49.8 in a 1M-user experiment, that may be a 1-in-500,000 fluke: invalidate the result
8% of Microsoft experiments had SRM before the check existed — meaning 8% of results shipped on false data
Common causes: bots behave differently in treatment; data pipeline drops rows selectively; marketing campaign skews one arm
Teams ignored warning banners — Ronny blanked the entire scorecard, forcing a click-through with red highlights on every number
Twyman's Law
"Any figure that looks interesting or different is usually wrong." If your metric normally moves 0.5% and yours shows 10%, hold the dinner. Nine out of ten: it's a bug, not a breakthrough.
P-values: the most misunderstood number in product
Most PMs read p < 0.05 as "95% chance we’re right." That is wrong.
Correct framing: apply Bayes’ rule using historical failure rate to get "false positive risk"
At Airbnb search (8% win rate), a p < 0.05 result still carries a 26% false positive risk — not 5%
Fix: require p < 0.01 AND replicate — combine experiments with Fisher’s or Stouffer’s method
Speed without trust = junk data
Optimizely’s early real-time p-value stopping inflated type-1 error from 5% to ~30%. Companies thought they were shipping winners. Most weren’t. Trust, once lost, is very hard to rebuild with an org.
Tactics
The Practical Experiment Playbook
200K user threshold: below that, only huge effects (5–10%) are detectable. Build culture first, infrastructure second.
OFAT — One Factor At A Time: never batch 17 changes into one launch. Of 17, ~4 will win. Ship those 4.
Portfolio thinking: 80% incremental (known wins compound); 20% big bets (expect 80% to fail)
Variance reduction: cap skewed metrics (nights booked > 30/month) to reach significance faster with fewer users
CUPED: use pre-experiment data to reduce variance — same unbiased result, fewer users needed
Don’t ship flat: statistically insignificant = zero user benefit + added maintenance cost — kill it
Legal ≠ skip testing: if a legal requirement forces a change, run 3 variants and ship the one that hurts least
goodui.org shortcut
140+ documented UI patterns with real A/B win rates. Check this before writing a brief — someone likely already tested your idea.
Contrarian
A/B Testing Myths Ronny Hears Every Week
✗Experiments kill big swings — they trap you in incremental thinkingINSTEAD →✓ Experiments enable big bets by telling you fast if they work. The $100M Bing ad swap was a "meh" backlog item. Nobody predicted it. Only testing revealed it. The oracle is the experiment.
✗Our team is better — we won’t see a 70% failure rateINSTEAD →✓ Every team says this. Every team is humbled within months. Booking, Google, Airbnb, Microsoft — all 80–92% failure. When Ronny joined Microsoft, they said "we have better PMs." They didn’t.
✗P < 0.05 means there’s a 95% chance my result is realINSTEAD →✓ At a typical product team, p < 0.05 still carries a 15–26% false positive risk. Use "false positive risk" not p-value. Lower your threshold. Replicate experiments. Combine results statistically.
✗Flat results should still ship — the team worked hard on itINSTEAD →✓ Flat = zero user benefit + added code complexity + ongoing maintenance cost forever. Sunk-cost fallacy. Data-driven orgs reward honest kills. The experiment already told you: don’t ship.