paint-brush
How to Run Tons of Experiments at the Same Time Using an Adaptive Control Groupby@schaun.wheeler
136 reads

How to Run Tons of Experiments at the Same Time Using an Adaptive Control Group

by Schaun WheelerMay 15th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

A/B testing is dying, and I’m here for it. Discrete, time-bound evaluation of a small set of interventions just doesn’t consistently yield lastingly actionable results. In any real-world business situation, the number of things to test gets overwhelming very fast.
featured image - How to Run Tons of Experiments at the Same Time Using an Adaptive Control Group
Schaun Wheeler HackerNoon profile picture

A/B testing is dying, and I’m here for it. Discrete, time-bound evaluation of a small set of interventions  ( sometimes only one intervention) just doesn’t consistently yield lastingly actionable results.


  • There are too many things you could potentially test. In any real-world business situation, the number of things to test gets overwhelming very fast — if you’re using an A/B testing framework. The overwhelm is a limitation of the testing approach, not a feature of the testing environment.


  • It can take a long time to run a test, and a long, long time to run lots of tests. You have to be careful that different tests don’t overlap in the users they impact. You have to avoid days and locations that the business isn’t willing to tie up in a test. It monopolizes a lot of resources to run A/B tests.


  • A test potentially sacrifices a lot of impact on the losing variant — if you run A vs. B for a month and find that A performed a lot better, that means you showed half of your users the low-performing variant for a whole month. You lost all of that value. No one is happy about that.


  • The long-term effectiveness of a test is never certain. The impact of any choice you make can be influenced by time of day, day of week, time of month, time of year, world events, changes in the market — just because A was better than B in the month that you tested it, doesn’t mean it will always be better. And no A/B test can tell you the shelf-life of its results.


If you want a bit more in-depth discussion of the problems with A/B testing, the folks over at [Babbel](https://Run tons of experiments at the same time using an adaptive control group) have a good presentation on the subject, and this tutorial on bandit feedback is a great perspective from several industry leaders.

Multi-armed bandits are the future, and the future is now, and now we have new problems.

In a traditional A/B testing setting, you have variant A and you have variant B. In most real-world situations, either A or B is “better” only in the statistical sense.


If you run a test and A gets a 20% success rate and B gets a 10% success rate, A clearly “wins”…but what about the people who responded to B? Are they going to be ok with getting A? Both A/B tests and bandit algorithms force you to sacrifice your minority preferences for the sake of the majority. It doesn’t have to be that way — that’s just the way those particular instruments work. A better strategy is to get option A to the people who prefer option A, and option B to the people who respond to option B. So:


  • Send option A to 100 people and option B to 100 people.
  • Option A’s 20% success rate means you got 20 successes.
  • Option B’s 10% success rate means you got 10 successes.


Let’s be generous and assume half of the people who responded to option B actually would have responded to option A if they’d seen that instead.


That means:


  • Showing only option A after the test is done yields a 12.5% success rate (the 20 who responded to A, plus the 5 who responded to B but would have responded to A, divided by the 200 total people across both groups).
  • Sending option A to people who want A and B to people who want B yields a 15% success rate.


So by adjusting how you deploy each treatment based on based results, you leave less value on the table. That’s all a bandit algorithm does — it hedges bets: B is half as successful as A, so you actually show B half of the time. You can do this with lots of different options at the same time (not just A and B), the automatic deployment and readjustment makes it less costly to run tests, you don’t sacrifice as much value to losing variants, and the system can adjust to changes in user preferences or the larger decision environment.


All the problems of A/B testing solved!


But this can backfire.


How often are you going to show B to the people who prefer A, or show A to people who prefer B, because you’re basing your decision on aggregate statistics rather than individual preferences? It’s actually possible for the bandit algorithm to perform worse than the A/B test in these kinds of situations. And, of course, all of these things can change over time. Maybe half of the people who liked B actually change over time to prefer A. And a fourth of the people who liked A change over time to like B. Your aggregate statistics, and therefore the decision you’ll make about what to show to whom, will remain exactly the same. That’s not optimal.


Regular bandit algorithms carry hidden costs — or, rather, they take the costs of A/B tests and shuffled it around to different places so you don’t notice it as easily. You set up your algorithm and start sending and everything looks great…until you start to realize some of the issues I mentioned in the previous paragraphs. Maybe the balance of preferences for A vs. B is different for, say, new users than it does for returning users. Maybe those preferences are different for different geographies. Maybe even experienced users can be divided into those who are power users and those who are just regulars. This is why people have invented contextual bandits, which is really just a fancy word for bandits plus segmentation.


Now you have to do a lot more reporting to understand which segments of your user base might have different preference profiles. So you reduced reporting needed to analyze experiments, but you increased reporting needed to scope your bandit. And you’ve increased the amount of work needed to turn that reporting into actual scoping. And once you have these different segments, you realize that maybe you need more creatives to take that context into account, so that’s more work. And then there’s the engineering work to build out the pipelines that will get the right user into the right bandit. And there’s the work you need to do in your messaging system to make sure it supports all of this stuff going on in the background.


So bandits solve a lot of problems of A/B testing, but bandits that are truly effective create new analytic needs and new logistical hurdles that aren’t easy to solve. Which is one of the reasons A/B testing is still so popular: the process is common enough that there are a lot of tools to help with the heavy lifting.

Dynamic testing requires dynamic evaluation, which requires a dynamic control group.

So I helped design and build a product that makes complex contextual bandit testing easy — so easy that it creates a separate context for each individual user on your site or app. You can find more details about that product here, that’s not really the point of this post, so I won’t talk any more about it. What I want to talk about here is how we solved the problem of evaluating hundreds of thousands of individualized adaptive tests per day.


The details can be found in our paper on arXiv.


I’ve written before about the practical, analytic, and sometimes even ethical challenges inherent in constructing a holdout group in order to evaluate experiments. I still stand by that. We evaluate our adaptive experiments using a synthetic control, because that doesn’t involve depriving any users of potentially beneficial interventions. However, traditional synthetic control methods can be full of analytic pitfalls, because you’re essentially modeling the baseline data-generating process for the environment in which you’re conducting your experiment. Throw in lots of and lots of parallel experiments, many of which take place in overlapping environments, and an analytic solution to the control problem becomes…daunting.


Which is why we didn’t go that route.


Gary King and his colleagues at Harvard, several years ago, came up with a wonderfully simple method for drawing causal inference from observation data. It’s called Coarsened Exact Matching (CEM). You can find the seminal paper here and the theoretical underpinnings here.


The idea is simple:


  1. Collect all of your observations of your intervention (test) taking place.
  2. Collect a bunch of observations where the intervention did not take place but could have.
  3. Pick attributes that can measure the similarity between any particular pair of observations from the two groups.
  4. “Coarsen” attributes into categorical variables. So if “age” is an attribute, you can bin it into age categories.
  5. Match each intervention observation to a non-intervention observation based on exact matching of coarsened attributes. This means you’ll pick only a subset of non-intervention observations, and often you’ll also end up dropping some of your intervention observations as well, but what you have left will be matched.
  6. Model the difference between the two refined groups.


CEM moves the complexity of causal inference away from analytic methods — you can use whatever method you prefer — and places it instead in the dataset creation methods. It’s similar, conceptually, to over- or under-sampling an imbalance dataset in a classification problem.


What we realized was that we could use this same kind of logic to find appropriate control contexts for our bandit experiments by including time as one of the features to match on. We already match on certain intervention attributes — the type of intervention a user received and the level of activity the user exhibited on the app at the time of the intervention. But then we also define an observation window and ensure that any matched user will have received an intervention at a time period close to the intervention for which we are seeking a control, but not within the observation period of the intervention itself.


This allows us to have controls matched at the user level for the majority of the tests we run. Bandit algorithms get rid of some of the complexity of A/B testing at scale, but hide other parts of that complexity. Our control method takes that hidden complexity and resolves it so we can get the adaptive benefits of bandit assignment, but the clear inference and attribution of A/B testing.

So here’s your to-do list:

  1. For every intervention you make, identify a look-ahead and a look-behind window. The look-ahead window is what you use to see how the user responded to the intervention, and the look-behind window is where you look for control cases.
  2. For each interventions, identify a pool of other interventions that (1) took place within the look-behind window, and (2) don’t have a look-ahead window that overlaps the look-ahead window the intervention for which you are seeking a control.
  3. Match the users who received those potential control interventions with the user who received the intervention for which you are seeking a control. You can match on any criteria you want - level of activity, similarity of intervention received, etc.
  4. Randomly select one user from those who make it through the matching process.
  5. Pretend you sent the original intervention not only the user who actually received it, but also to the user whom you have selected as a control.
  6. Measure the difference in response between your test and control users for whatever time period you are interested in.




Again, you can find more information in the paper on arXiv.