Big experiments: Big data’s friend for making decisions

April 3, 2014 by Eytan Bakshy, Dean Eckles, Michael S. Bernstein

When people think about the tools of data science, they often focus on machine learning, statistics, and data manipulation. Modeling massive datasets is indispensable for making predictions – like predicting which set of News Feed stories or search results are most relevant to people. But such models also have limitations in terms of their ability to help with understanding cause-and-effect relationships that lead to building better products and to advancing behavioral sciences.

On the Data Science Team, part of our job is to inform strategic and product decisions. Does a new feature we are testing improve communication? Does having more friends on Facebook increase the value people get out of the service? While a correlation between variables may suggest a particular causal relationship, it is hard to use such data to credibly answer many of these questions because of difficult-to-adjust-for confounding factors. Furthermore, when you change the rules of the game – like launching a completely new feature, it’s often impossible to have any data at all to anticipate the effects of a future change.

Because of this, data scientists, engineers, and managers turn to randomized experiments, which are commonly referred to as “A/B tests”. Typically A/B tests are used as a kind of “bake-off” between proposed alternatives. But experiments can also go beyond bake offs and be used to develop generalizable knowledge that is valuable throughout the design process.

Data science needs better tools for running experiments. Despite the abundance of experimental practices in the Internet industry, there are few tools or standard practices for running online field experiments. And existing tools tend to focus on rolling out new features, or automatically optimizing some outcome of interest.

At Facebook, we run over a thousand experiments each day. While many of these experiments are designed to optimize specific outcomes, others aim to inform long-term design decisions. And because we run so many experiments, we need reliable ways of routinizing experimentation. As Ronald Fisher, a pioneer in statistics and experimental design said, “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.” Many online experiments are implemented by engineers who are not trained statisticians. While experiments are often simple to analyze when done correctly, it can be surprisingly easy to make mistakes in their design, implementation, logging, and analysis. One way to consult a statistician in advance is to have their advice built into tools for running experiments.

PlanOut: a framework for running online field experiments

Good tools not only enable good practices, they encourage them. That’s why we created PlanOut, a set of tools for running online field experiments, and are sharing an open source version of it as part of the Data Science Team’s first software release.

Importantly, PlanOut gives engineers and scientists a language for defining random assignment procedures. Experiments, ranging from simple A/B tests, to factorial designs that decompose large interface changes, to more complex within-subjects designs, can be expressed with only a few lines of code. In this way, PlanOut encourages running experiments that are more akin to the kind you see in the behavioral sciences.

Logging can also be a pain-point for many experiments. Logging is often considered separate from the randomization process, which can make it difficult to keep track or even define exactly which units (e.g., user accounts) are in your experiments. This is especially problematic if engineers change experiments by adding or removing additional treatments mid-way through the experiment.  PlanOut helps reduce these kinds of errors by automatically logging exposures, and by providing a way to tie outcomes to experiments. Finally, this kind of exposure logging increases the precision of experiments, which lowers the risk of coming across a false negatives (“Type II errors”).

A single experiment is rarely definitive. Instead, follow-on experiments might need to be run as development of a new product takes place, or as decision-makers evaluate the effects of a major product rollout. While it is fairly straightforward to run a single experiment, there are few established protocols for how follow-on experiments should be run, and how data for these experiments should be logged.  PlanOut includes a management system that organizes experiments, and the parameters they manipulate, into namespaces. This allows distributed teams to work on related features, and launch follow-on experiments in a way that minimizes threats to validity.

Open sourcing the code.

The code for PlanOut is on Github. We welcome developers and researchers in the open source community who are interested in contributing directly to the original repository, or by forking it.

You can read more about PlanOut in the paper we will be presenting at this year’s ACM International Conference on the World Wide Web (WWW 2014): Designing and Deploying Online Field Experiments/ Eytan Bakshy, Dean Eckles, Michael Bernstein. Proceedings of WWW 2014.