Online randomized controlled experiments at scale: lessons and extensions to medicine

Ron Kohavi; Diane Tang; Ya Xu; Lars G Hemkens; John P A Ioannidis

doi:10.1186/s13063-020-4084-y

Online randomized controlled experiments at scale: lessons and extensions to medicine

Trials. 2020 Feb 7;21(1):150. doi: 10.1186/s13063-020-4084-y.

Authors

Ron Kohavi^{1

2}, Diane Tang³, Ya Xu⁴, Lars G Hemkens⁵, John P A Ioannidis^{6

7

8

9

10}

Affiliations

¹ Analysis & Experimentation, Microsoft, One Microsoft way, Redmond, WA, 98052, USA.
² Airbnb, 888 Brannan St, San Francisco, CA, 94103, USA.
³ Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA.
⁴ LinkedIn, 950 W Maude Ave, Sunnyvale, CA, 94085, USA.
⁵ Basel Institute for Clinical Epidemiology and Biostatistics, Department of Clinical Research, University Hospital Basel, University of Basel, 4031, Basel, Switzerland.
⁶ Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Medical School Office Building, Room X306, 1265 Welch Rd, Stanford, CA, 94305, USA. jioannid@stanford.edu.
⁷ Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Palo Alto, CA, 94305, USA. jioannid@stanford.edu.
⁸ Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA, 94305, USA. jioannid@stanford.edu.
⁹ Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, 94305, USA. jioannid@stanford.edu.
¹⁰ Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, CA, 94305, USA. jioannid@stanford.edu.

Abstract

Background: Many technology companies, including Airbnb, Amazon, Booking.com, eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yahoo!/Oath, run online randomized controlled experiments at scale, namely hundreds of concurrent controlled experiments on millions of users each, commonly referred to as A/B tests. Originally derived from the same statistical roots, randomized controlled trials (RCTs) in medicine are now criticized for being expensive and difficult, while in technology, the marginal cost of such experiments is approaching zero and the value for data-driven decision-making is broadly recognized.

Methods and results: This is an overview of key scaling lessons learned in the technology field. They include (1) a focus on metrics, an overall evaluation criterion and thousands of metrics for insights and debugging, automatically computed for every experiment; (2) quick release cycles with automated ramp-up and shut-down that afford agile and safe experimentation, leading to consistent incremental progress over time; and (3) a culture of 'test everything' because most ideas fail and tiny changes sometimes show surprising outcomes worth millions of dollars annually. Technological advances, online interactions, and the availability of large-scale data allowed technology companies to take the science of RCTs and use them as online randomized controlled experiments at large scale with hundreds of such concurrent experiments running on any given day on a wide range of software products, be they web sites, mobile applications, or desktop applications. Rather than hindering innovation, these experiments enabled accelerated innovation with clear improvements to key metrics, including user experience and revenue. As healthcare increases interactions with patients utilizing these modern channels of web sites and digital health applications, many of the lessons apply. The most innovative technological field has recognized that systematic series of randomized trials with numerous failures of the most promising ideas leads to sustainable improvement.

Conclusion: While there are many differences between technology and medicine, it is worth considering whether and how similar designs can be applied via simple RCTs that focus on healthcare decision-making or service delivery. Changes - small and large - should undergo continuous and repeated evaluations in randomized trials and learning from their results will enable accelerated healthcare improvements.

Keywords: A/B tests; Healthcare decision-making; Online experiments; Randomization; Trials.

MeSH terms

Decision Making, Organizational*
Humans
Internet-Based Intervention*
Mobile Applications
Randomized Controlled Trials as Topic*
Research Design*