Why we are still in the Alchemy days of influencing online behavior

An interview with Ronny Kohavi on trustworthy experiments

Ronny Kohavi is a Microsoft Technical Fellow and Corporate Vice President, Analysis & Experimentation. He joined Microsoft in 2005 and founded the Experimentation Platform team in 2006.

His papers have over 27,000 citations and three of his papers are in the top 1,000 most-cited papers in Computer Science.

In short, Ronny is the big kahuna of experimentation and I am very happy he took time to have this discussion with me.

Running experiments is done in different disciplines. Lately there is a lot of attention about A/B tests, or online randomized controlled experiments, in social networks and in search engines result pages for example.

In some engineering disciplines, experiments are also used, where experimenters often try to create a robust process, a process affected minimally by external sources of variability. This can be done either for new product design or formulation, manufacturing process development and process improvement.

What do you see as the biggest difference between these two types of experiments? What can we learn from each other?

Let’s look at a few attributes required for experimentation and compare them for the two families of experiments, which for simplicity I’ll refer to as “lab” experiments and “randomized” experiments.

The dependent variable of interest

This is the phenomenon we are trying to understand. In physics, we may be interested in determining a constant, like the speed of light, or the relationship between factors in Ohm’s law. In online randomized experiments, the dependent variable is often driven by the business, and may be something like Sessions/user (the northstar metric for Bing, as noted in Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained). We often refer to the dependent variable as the Overall Evaluation Criterion (OEC) to highlight that it may be a combination of multiple factors (e.g., Session/user is one factor, but revenue is another).

Repeatability is critical for establishing causality in both lab and randomized experiments

In Robert Frost’s “The Road Not Taken,” he wrote:

Two roads diverged in a wood, and I — I took the one less traveled by,

and that has made all the difference

But that statement is causal and cannot really be known. Being a single person, he can’t be sure, as pointed out by Angrist and Pischke in Mastering ‘Metrics: The Path from Cause to Effect.

Another great example comes from the Stimulus program, called the American Recovery and Reinvestment Act of 2009, after the Great Recession of 2008. The value of the Act was heavily debated, but given that it was a singular event, it is practically impossible to estimate the counterfactual, or the outcome if it were not done. As Jim Manzi pointed out in the great book Uncontrolled, the only thing we can say with high confidence given the conflicting opinions is that at least several Nobel laureates in economics would be directionally incorrect about its effects. We would not even know which of them were right or wrong, even after the fact.

Both lab experiments and randomized experiments need a setting in which we can replicate some setting and observe two actions, one of which may be the null action. In online controlled experiments for websites, we achieve this by randomly choosing between two (or more) versions of the website for each user, keeping the experience consistent (sticky).

Lab versus real-world experiments

Our ability to simulate the real-world in the lab (or focus group) may fail, and lab results may not be predictive of actual consumer behavior. People say one thing and then react differently outside the lab.

Here are two great examples.

1. Philips Electronics ran a focus group to gain insights into teenagers’ preferences for boom box features. The focus group attendees expressed a strong preference for yellow boom boxes during the focus group, characterizing black boom boxes as “conservative.” Yet when the attendees exited the room and were given the chance to take home a boom box as a reward for their participation, most chose black (Cross and Dixit, 2005, Customer-centric pricing: The surprising secret for profitability).

2. A study of food consumption when the bill is paid individually or split equally amont diners, gave different results in the lab vs. in restaurant settings. In a restaurant setting, the six participants consumed more food when they knew the cost was being split evently, with each paying 1/6th of the total, relative to individuals paying their order; but a similar study in the lab did not show a stat-sig increase in consumption (Gneezy, Haruvy, Yafe, 2004, The Inefficiency of splitting the bill).

Factors impacting the dependent variable of interest

This is where the biggest differences emerge. We can control a few variables in the lab, but when the number of potential variables to control is large, one must resort to randomized experiments.

Historically, statistics-based randomized controlled experiments really started with R.A. Fisher’s agricultural experiments in the 1920s. Drug approval in the US evolved from expert opinions to more objective randomized controlled trials (RCTs). In the software domain, many companies are adopting online controlled experiments to help evaluate features and aid product development. In all these settings, the number of factors is so large, that it is practically impossible to control all of them. It is much easier and more trustworthy to randomly assign users (or experimental units) to the different variants.

In the early software development days, Microsoft tested software as a lab experiment: read the spec, design test cases, and check for the expected output. In 2005 when I joined Microsoft, Office had a ratio of 1:1 between developers and testers!

As Bing scaled its A/B testing program, it became clear that this is a more scalable way to evaluate software, not just because the test matrix grew too fast, but because it gave a way to assess the value of features by looking at users’ actions, so we could assess the value of ideas, not just whether the developers coded what the spec said.

Let’s say you’re making a change to a search engine’s ranking algorithm (e.g., Bing, Google). This changes the search engine result page (SERP) for millions of queries. How can we determine if the new algorithm is better? We can (and do) have judges that evaluate some queries, but the set is of limited size and the cost is high. We also never really know what the user’s intent was. We can tell the judge the user’s device (e.g., desktop or mobile phone) and location (when available), but our overall information about the user’s state of mind and intent is limited (and there are many factors).

Take the query “taxi” as an example. Are you looking for a cab? Are you looking for the TV series from the late 1970s and early 1980s? Are you looking for taxi.com, which claims to be the World’s leading independent Artist & Repertoire company? Are you looking for the Uber app? Perhaps the recent news on a taxi collision that you heard has closed a major highway to your destination?

The controlled experiment “integrates” over all these factors to give you the key metrics you care about for the different variants in the experiment. We can evaluate key metrics, such as: are users coming back more (Sessions/user), are more users clicking successfully (no backbutton in X seconds) and where they click on the page (higher is usually better), how long it takes them to click, how long did it take the page to load, and the amount of revenue we made. While we make our ship decisions based on just a few metrics, we share thousands of metrics with our experimenters, as these can lead to interesting (often surprising) insights. Everyone building a feature believes that their new treatment will ourperform the control, but 80% of our experiments are flat or negative on Bing, so the experience is very humbling.

Concurrent experiments

From statistical power formulas, and the need to be sure we are not degrading key metrics, Bing experiment treatments run on 10%-20% of users (see this discussion). Because many experiments need customer controls for counterfactual triggering, if experiments had to be disjoint, you could run only about 5 experiments at a given point in time!

The need to run more controlled experiments at Bing led to the design of a system that supports concurrent experiments. A user was no longer in a single experiment, but in multiple concurrent experiments. In Online Controlled Experiments at Large Scale, we noted that with users falling into 15 concurrent experiments, they fall into one of 30 billion possible variants of Bing! Try that in a lab experiment 😉.

Douglas Montgomery’s Design and Analysis of Experiments is a bible for “experimental engineers”.

One thing that I really like about the approach of Montgomery is that he focuses on building an empirical model. Well designed experiments can lead to a model of system performance. When you turn the results of a designed experiment into an empirical model of the system under study, scientists/engineers can manipulate such models just like they could manipulate mechanistic models.

I agree that this is an ideal goal. In several of my talks (most recently my Keynote talk at KDD 2015), I shared a four-stage cultural model for adoption of experimentation, going from (1) hubris to (2) measurement and control to (3) Semmelweis reflex and finally (4) fundamental understanding.

Building empirical models as Montgomery suggests can help gain better understanding of the underlying factors. In the case of Semmelweis, we cracked the cause of “child fever” with Louis Pasteur’s discovery of Streptococcus.

That said, our attempts to generalize the results of thousands of controlled experiments have had limited success. See Seven Rules of Thumb for Web Site Experimenters for some Rules of Thumb.

The reality is that our models of users’ online behavior are very limited, and probably akin to early days of Alchemy, when scientists were trying to turn lead into gold. Steve Krug’s book Don’t Make Me Think is a great resource I recommend, his recommendations are based on an expert’s experience with little data to let us understand their applicability and limits. Alchemy led to Chemistry, which transformed the world, but we need to be humble about how far we are from having good models of user behavior: we are still in the Alchemy days, where the models were severely limited and a lot of random ideas were tried.

When engineers talk about robustness I guess this fits your idea of trustworthiness. As you’ve said many times, “Generating numbers is easy; generating numbers you should trust is hard!” But isn’t generating trustworthy the most difficult thing about experiments?

Robustness is a property normally associated with resisting perturbations or outliers.

Trustworthiness is a more general concept that covers the overall end-to-end system.

Here is a real example that we have not previously shared. When we started running experiments on mobile devices, we were using our well-trusted ExP system for analysis. The metrics were carried over from the desktop world and considered robust to the patterns we normally observed (outliers, robots, etc).

But it turned out that the incoming click data was seriously flawed: every scroll event on the phone was registered as a click.

The ability to audit the data with a skeptical eye, investigate those metrics that seem to be suspicious, and identify the underlying causes, is what leads to trustworthy results.

A couple of months ago I shared a set of pitfalls on quora and at ConversionXL. Understanding these and avoiding them can help improve trustworthiness of results.

Lots of engineering focused experimental designs are often factorial designs, they study the effects of two or more factors. I do see the value of these factorial designs, because the world is pretty complex, and in the context you want to study something often multiple things are at play.

But this raises the complexity of experiments and it limits the people understanding them as well I guess. The current situation is that not many people fully understand simple experiments and the need for them. Do you see more complex experiments as a way forward or do you think it might harm the wider acceptance of experimentation?

We discussed this point in our survey paper. As you say, there are certainly cases where factorial designs can help due to interactions. In many cases, you will end up at the same hill by doing OFAT (One Factor At a Time), but obviously not in all cases.

Our approach has been to encourage explicit testing of combinations, but normally at low interaction levels: pick two factors and try all combinations, or pick three factors and pick those that “make sense.” In many cases, UX guidelines may limit some combinations, so you don’t want to try the full factorial.

In relationship to the cultural model, where users view controlled experiments as a tool, we favor simplicity and agility.

If you take the experimenters as subjects a lot of interesting cognitive biases and illusions are at play. You mentioned Twyman’s law in one of your talks: “Any figure that looks interesting or different is usually wrong”

What cognitive biases and illusions do you see happening a lot and how to overcome them?

The most common bias is to accept good results and investigate bad results.

If an experiment made a change to some feature unrelated to revenue, but revenue improved unexpectedly, the inclination is to come up with some half-baked explanation and celebrate the success. Conversely, if revenue declined unexpectedly, the experimenter will be inclined to claim “noise” or attempt to find a segment that has a large decline and make some association (e.g., Tuesday was bad, but this was the day we had this other issue, so we can’t trust Tuesday).

The best data scientists have a skeptical eye towards surprising results in both directions.

A common bias is to end an experiment when it is statistically significant. In my recent Pitfalls talk at ConversionXL slide 20, I noted that two good A/B testing books get the procedure wrong. This bias is easy to overcome by forcing the experimenter to decide on the experiment duration in advance, and ensuring it’s a multiple of 7 (e.g., one or two weeks is common).

A common bias is to assume that the new feature, which is underperforming, suffers from Primacy Effect: users are just used to the old way and it takes time to adopt. Sometimes it easy to fall into this trap by looking at cumulative graphs, a point we made in Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained (Section 3.3).

It is extremely rare for a feature to be stat-sig negative and turn stat-sig positive. Most of the time the “hope” dies after a while when the negative results don’t improve.