Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics
Industry keynote at ACM Recommender Systems , Sept 12, 2012
The web provides an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments (e.g., A/B tests and their generalizations). Whether for front-end user-interface changes, or backend recommendation systems and relevance algorithms, online controlled experiments are now utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher’s experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale—thousands of experiments now—has taught us many lessons. We provide an introduction, share real examples, key learnings, cultural challenges, and humbling statistics.
What people said
- Greg Linden's tweet: Don't miss: Ronny Kohavi's RecSys 2012 slides (on A/B testing and recommendations).
- Dan Weld's tweet: I loved Ronny Kohavi's classic paper on A/B testing, but his Recsys2012 keynote slides are even better!
- Justin Hunter, Founder and CEO of Hexawise: Absolutely fantastic presentation. Phenomenal. The world would be a better place if more people (a) understood Design of Experiments, (b) listened and acted upon the data (instead of placating "hippo's"), and (c) could advocate for adoption of Design of Experiments-based methods even half as well as you can. Bravo!
- Xavier Amatriain from Netflix wrote a nice summary of Recsys 2012, which included:
I am glad to see that this has become a relevant topic for the conference, because many of us believe this is one of the most important topics that need to be addressed by both industry and academia. One of these people is Ron Kohavi, who delivered a great keynote on "Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics", where he described his learnings of many years of AB Testing in Amazon and Microsoft. It is funny that I cited his KDD 2012 paper in two slides in my tutorial, not knowing that he was in the audience. I recommend you go through his slides, it was one of the best talks of the conference for sure.
- Ossi Mokryn's tweet: fascinating #recsys2012 industry keynote talk. Great presenter, deep observations. My vote for best talk of the day.
- Alan Said in his RecSys 2012 summary wrote: As usual, the industry session was filled with interesting talks from interesting people working at interesting companies. The two most interesting talks were given by Ron Kohavi from Microsoft and Paul Lamere from The Echonest. Ron's keynote "Online Controlled Experiments: Introduction, Learnings, and Humbling Statistics" (slides) was an insightful talk about online testing and concepts such as the "HiPPO" (Highest Paid Person’s Opinion) and A/A testing, i.e. testing the same settings/algorithms on different groups before going on with A/B tests.
- Sumanth Kolar's tweet: My Q for @ronnyk : With multiple key metrics for ABs, in the case some go up, some down. Whats the framework to decide if test is a success? Answer: Ideally, the OEC should be a formula, i.e., translate everything to lifetime value. Hard, of course. Tradeoffs must be made: is improving key metric sessions/UU by delta1 worth revenue/UU by delta2?
- denisparra's tweet: Btw, I found enlightening @ronnyk mention of doing A/A testing: test same condition in diff groups before A/B testing.
- Werner Geyer note: Great talk by Ron Kohavi at RecSys 2012
- Steve Blank (blog): The presentation is awesome. Everyone doing Customer Development ought to read it.
During the talk, I asked the audience of about 270 to assess what was the outcome to 3 A/B tests. Everyone started standing up, then had to sit down if they were wrong. Since each A/B test has three outcomes (A wins stat sig, wins stat sig, or they're approximately the same), 1/3(^3) * 270 = 10, and about 8 people stayed standing at the end. The room did about random (perhaps slightly worse).