Sometime within the last year, We saw Zach Holman’s post “Move Fast and Break Nothing” about how GitHub was able to replace parts of the code base with minimal risk of breaking things. At Work Market, we’re migrating from a monolith application to microservices. We wanted to extract components of our monolith into the new architecture in a safe way. So today, we’re introducing Jan20.
The idea of succession is that you can run both the old “happy” path code, and the new code together in parallel. Both code path’s results are compared to note any undesired deviance. Then, one of these results is returned (more on that later). This way, you can make sure that your new code behaves like the old with minimal risk. There are some kinds of changes you can’t really do this way; experimentally sending emails from both old and new code would be a terrible experience for the end user, getting duplicates of everything; but for many kinds of changes it works really well.
We looked around for an experiment/succession framework for Java, but didn’t really find any acceptable options. So we wrote one. We’ve now “successioned” out five microservices, with great success, and have two currently in progress. And with these successes, we’ve gotten it to the point where we feel it’s solid enough to share. So today, it will be the first of hopefully many things that we at Work Market will release as open source.
While it was written to be usable from Java 7, as our monolith is currently in Java 7, many aspects of the succession framework look cleaner as Java 8 lambdas, etc. It also makes use of the RxJava library as we use it extensively in our microservices, as well as their client libraries. The succession library itself reports metrics using the DropWizard Metrics (aka Codahale Metrics).
The execution path gets run for both the control or the experiment are Callables that return an Observable of a result type, or succinctly, Callable<Observable<T>>. The control and experiment Callables are then passed into the trial, along with an IsEqual and the experiment name.
For an example, let’s look at the case of replacing the way we validate usernames and passwords. Assuming our old APIs returned Observables already, you’d have something like this:
Now we need to define exactly what we mean by “do these behave the same”. To define our threshold for “same”, there is the IsEqual interface. Why not just use Comparable? For the purpose of a trial, we don’t care about any kind of ordering, we only care about equality. Why not just use .equals()? Often, the notion of equality for an experiment is not the same as what you want for .equals(). An example of this is where the new API returns two different possible enum values for a case where the old returned just one. Or maybe only a few fields are significant for equality. Often times, timestamp fields aren’t going to be exact between the two systems. Having a distinct IsEqual allows you flexibility in defining what’s “close enough”.
For an experiment, there are actually two things to check, the result (in the example below, a Boolean), but there’s also the possibility that either the control or the experiment may throw an exception. For most cases we’ve encountered, we really only care that either both or neither control nor experiment threw. But there are cases whereby you might want to check what flavor of exception was thrown and/or other attributes of them. But for the both or neither case, this is what this IsEqual would look like:
It could be as simple as just returning result, but when doing experiments like this, when the results don’t match, you want to know what about them was different. In our code, we log mismatches to Kafka as JSON objects, so that we can investigate the mismatches. We didn’t write a Kafka consumer for this, we just use the Kafka console consumer to get the contents of the topic, and dump them to a file. But you could instead just log them with whatever logging framework you’re currently using.
For the IsEqual to actually check the result, we do something like this:
There’s a bit here, so let’s unpack it. For just a boolean value, this is overkill, but when you have a bunch of fields you are comparing, this can make things much more convenient. This allows you to create a list of strings denoting where things didn’t match. startCompare returns an EqualChain, which has a number of differing ways to “equal things”. Here, you see dotEquals which uses .equals as you might expect. There’s also doubleEquals which uses == instead, and compareTo which uses the compareTo method (assuming your type implements Comparable). It’s also smart enough to let you know (your IDE does deprecation warnings right?) if you try to do one of the above with a double or a float, which you should then use a three argument version to specify the delta.
The output of the whole thing is the list of mismatching elements, which is super handy when you have more than just a couple of fields when you’re sorting through the mismatch data, and whether they ultimately were equal or not.
With these two things, we can then make the IsEqual<TrialResult<Boolean>> for the experiment.
Trial.makeIsEquals really just combines the two IsEquals for you. Here, the .pairwiseEqual is because Observables can return more than one element, so in this case, it does a pairwise compare between the control and experiment, obviously failing if one returns more items than the other. Most of our APIs return Observables that either emit one element or an error, but that’s not always the case.
Now, it’d be good to actually run the trial and get the results.
This should be pretty obvious what the arguments are, save for the last one. “checkpassword” will the name under which metrics will be reported. The result value will be either the result returned from calling the control or experiment.
But how does this report metrics, and how is it decided which to return (or even run at all?). This is all set up where we actually construct the trial instance that we run trials under.
The metricRegistry is a regular Codahale metric registry. The third argument is the root name of the metric, under which individual trial runs will create their own succession metrics. So in the trial example given above, the succession metrics will me made under “password.succession.checkpassword”.
The fourth argument is a Supplier<WhichReturn> is what tells the trial which to return, either the control or the experiment. You can also return values that will say to only run the control or the experiment, and not to run the other one at all. So if the whichReturn supplier returns CONTROL, the control will run in the calling thread, and the experiment will run in a background thread. If the supplier returns EXPERIMENT, the experiment will run in the calling thread, and the control in the background. Either way, whichever runs in the background thread will not block the caller, that is: we don’t block for both to finish before doing the comparison; the comparison happens when both have completed, in the background thread.
The last two arguments are trial wrappers. They’re used in cases where you need to arrange things specially for whatever runs either in the main line (the first of the two), or in the background (the last arg). In the example here, we don’t do anything for the foreground case, but we have a trial wrapper that makes sure that the hibernate context is properly set up for whichever of control/experiment runs in the background.
This leaves the thread pool. You need a thread pool in which to run whatever will be run in the background. Why do executors with the limits like this? If the code that runs in the background is getting stuck or taking too long to run relative to the main line code, we don’t want to wedge our clients, nor do we want to have a zillion blocked threads going on. Given this only covers the execution of the callable, and the thing which manages the Observable bits may run into it’s own issues, but either way, we want the whichReturned code to return unimpeded no matter what the non-whichReturned code does, and have a minimal effect on the overall application should something go awry in the background. If the queue in the executor fills up, it will be reported in the metrics. Normally, when we start succession, we set the queue size to one, and adjust if necessary.
Additionally, there’s a demo directory in the main repo containing simple examples of how to run trials, and example cases for how each of the different metrics get triggered.
Trials produce a number of metrics, like the number of attempts, the number of successes and failures, if exceptions were thrown even trying to start trials, as well as if the thread pool was already full. In the end, it allows you to make graphs that look like this (it doesn’t show all the metrics reported by the trials):
This graph is actually pretty boring because everything is working, but that’s really the ideal, no? But when things go sideways, you can know.
Oh yeah, how do you get it? It’s called Jan20 (a reference to when US presidential succession occurs) and available here.
I’d be remiss if I didn’t give proper credit to Matt Yott who’s my partner in crime on this :)