A/B testing is dying, and I’m here for it. Discrete, time-bound evaluation of a small set of interventions ( sometimes only one intervention) just doesn’t consistently yield lastingly actionable results.
There are too many things you could potentially test. In any real-world business situation, the number of things to test gets overwhelming very fast — if you’re using an A/B testing framework. The overwhelm is a limitation of the testing approach, not a feature of the testing environment.
It can take a long time to run a test, and a long, long time to run lots of tests. You have to be careful that different tests don’t overlap in the users they impact. You have to avoid days and locations that the business isn’t willing to tie up in a test. It monopolizes a lot of resources to run A/B tests.
A test potentially sacrifices a lot of impact on the losing variant — if you run A vs. B for a month and find that A performed a lot better, that means you showed half of your users the low-performing variant for a whole month. You lost all of that value. No one is happy about that.
The long-term effectiveness of a test is never certain. The impact of any choice you make can be influenced by time of day, day of week, time of month, time of year, world events, changes in the market — just because A was better than B in the month that you tested it, doesn’t mean it will always be better. And no A/B test can tell you the shelf-life of its results.
If you want a bit more in-depth discussion of the problems with A/B testing, the folks over at [Babbel](https://Run tons of experiments at the same time using an adaptive control group) have a good presentation on the subject, and this tutorial on bandit feedback is a great perspective from several industry leaders.
In a traditional A/B testing setting, you have variant A and you have variant B. In most real-world situations, either A or B is “better” only in the statistical sense.
If you run a test and A gets a 20% success rate and B gets a 10% success rate, A clearly “wins”…but what about the people who responded to B? Are they going to be ok with getting A? Both A/B tests and bandit algorithms force you to sacrifice your minority preferences for the sake of the majority. It doesn’t have to be that way — that’s just the way those particular instruments work. A better strategy is to get option A to the people who prefer option A, and option B to the people who respond to option B. So:
Let’s be generous and assume half of the people who responded to option B actually would have responded to option A if they’d seen that instead.
That means:
So by adjusting how you deploy each treatment based on based results, you leave less value on the table. That’s all a bandit algorithm does — it hedges bets: B is half as successful as A, so you actually show B half of the time. You can do this with lots of different options at the same time (not just A and B), the automatic deployment and readjustment makes it less costly to run tests, you don’t sacrifice as much value to losing variants, and the system can adjust to changes in user preferences or the larger decision environment.
All the problems of A/B testing solved!
But this can backfire.
How often are you going to show B to the people who prefer A, or show A to people who prefer B, because you’re basing your decision on aggregate statistics rather than individual preferences? It’s actually possible for the bandit algorithm to perform worse than the A/B test in these kinds of situations. And, of course, all of these things can change over time. Maybe half of the people who liked B actually change over time to prefer A. And a fourth of the people who liked A change over time to like B. Your aggregate statistics, and therefore the decision you’ll make about what to show to whom, will remain exactly the same. That’s not optimal.
Regular bandit algorithms carry hidden costs — or, rather, they take the costs of A/B tests and shuffled it around to different places so you don’t notice it as easily. You set up your algorithm and start sending and everything looks great…until you start to realize some of the issues I mentioned in the previous paragraphs. Maybe the balance of preferences for A vs. B is different for, say, new users than it does for returning users. Maybe those preferences are different for different geographies. Maybe even experienced users can be divided into those who are power users and those who are just regulars. This is why people have invented contextual bandits, which is really just a fancy word for bandits plus segmentation.
Now you have to do a lot more reporting to understand which segments of your user base might have different preference profiles. So you reduced reporting needed to analyze experiments, but you increased reporting needed to scope your bandit. And you’ve increased the amount of work needed to turn that reporting into actual scoping. And once you have these different segments, you realize that maybe you need more creatives to take that context into account, so that’s more work. And then there’s the engineering work to build out the pipelines that will get the right user into the right bandit. And there’s the work you need to do in your messaging system to make sure it supports all of this stuff going on in the background.
So bandits solve a lot of problems of A/B testing, but bandits that are truly effective create new analytic needs and new logistical hurdles that aren’t easy to solve. Which is one of the reasons A/B testing is still so popular: the process is common enough that there are a lot of tools to help with the heavy lifting.
So I helped design and build a product that makes complex contextual bandit testing easy — so easy that it creates a separate context for each individual user on your site or app. You can find more details about that product here, that’s not really the point of this post, so I won’t talk any more about it. What I want to talk about here is how we solved the problem of evaluating hundreds of thousands of individualized adaptive tests per day.
The details can be found in our paper on arXiv.
I’ve written before about the practical, analytic, and sometimes even ethical challenges inherent in constructing a holdout group in order to evaluate experiments. I still stand by that. We evaluate our adaptive experiments using a synthetic control, because that doesn’t involve depriving any users of potentially beneficial interventions. However, traditional synthetic control methods can be full of analytic pitfalls, because you’re essentially modeling the baseline data-generating process for the environment in which you’re conducting your experiment. Throw in lots of and lots of parallel experiments, many of which take place in overlapping environments, and an analytic solution to the control problem becomes…daunting.
Which is why we didn’t go that route.
Gary King and his colleagues at Harvard, several years ago, came up with a wonderfully simple method for drawing causal inference from observation data. It’s called Coarsened Exact Matching (CEM). You can find the seminal paper here and the theoretical underpinnings here.
CEM moves the complexity of causal inference away from analytic methods — you can use whatever method you prefer — and places it instead in the dataset creation methods. It’s similar, conceptually, to over- or under-sampling an imbalance dataset in a classification problem.
What we realized was that we could use this same kind of logic to find appropriate control contexts for our bandit experiments by including time as one of the features to match on. We already match on certain intervention attributes — the type of intervention a user received and the level of activity the user exhibited on the app at the time of the intervention. But then we also define an observation window and ensure that any matched user will have received an intervention at a time period close to the intervention for which we are seeking a control, but not within the observation period of the intervention itself.
This allows us to have controls matched at the user level for the majority of the tests we run. Bandit algorithms get rid of some of the complexity of A/B testing at scale, but hide other parts of that complexity. Our control method takes that hidden complexity and resolves it so we can get the adaptive benefits of bandit assignment, but the clear inference and attribution of A/B testing.
Again, you can find more information in the paper on arXiv.