Amazon can do it, Google can do it, a lot of others can do it too: interruption-free service migrations.
So why are so many enterprise businesses, banks, and others not able to do it? Why do I still get a notification by my bank that there will be service interruptions between Saturday 4 p.m. and Sunday 10 a.m. Why is there still that senior manager asking the question upon a change request: "will that change be disruptive?" 20 years ago, that might have been a reasonable question, but when was the last time you went to google.com and got a “service down due to maintenance”?
I do not know why we are still talking about this. However, I do know how to do interruption-free service migrations. It’s very simple:
Let’s say they have 4 counters. #1, 2, and 3 are open and have a cashier behind them. Now the cashier behind counter #2 is done for the day and needs a replacement.
Here’s what the cashier won’t do: stop everything, take the cash drawer out, wait for her replacement, who has to put in his cash drawer, all while the whole queue of customers on that counter is waiting.
In fact, this is what will happen: A fourth cashier will open counter #4 and start serving customers. Next, the cashier at counter #2 will put up a closed sign to avoid getting more customers in the queue. She will still serve the remaining customers in her queue. Then she will close it.
Notice how this is a 4-step process:
Also, notice how this leaves all possibilities open: The cashier at counter #2 could suddenly decide to re-open, or not close at all, while the cashier at counter #4 still serves. Or the cashier at counter #4 could decide that it wasn’t a good idea, put a closed sign and leave after the remaining queue was served. It’s all seamless.
So why do we still have these nail-biting upgrade processes in production systems? These all-or-nothing-high-risk-no-rollback upgrade processes, potentially with service interruption, when those cashiers provide us with the perfect blueprint of how to do it?
Okay — I do realize that production systems are a bit more complex than what I described above. But as always, the way to deal with complexity, is to break it down into small, digestible chunks. This is the hard part. Once that is done, any service upgrade can be done gradually with low-risk and rollback at any point in time, just like the cashiers could do it above.
Let me try to use some examples in the following to make my point clear.
Almost any non-trivial system needs to deal with certificates. The problem is always, how do you make sure your clients and your servers get updated simultaneously?
The answer is, they don’t. Instead, the side verifying the validity of a certificate should allow a new one and an old one for the transition period.
Here are the steps:
Let’s say you have 1000 machines accessing an SQL database to feed a web app. A new version of the web app uses a renamed column in the database. So you need to rename that column in the database and simultaneously upgrade all your 1000 machines to the new web app version. How can you do that?
You can’t. Instead, your new version must be able to understand a schema which contains both the new column name and the old column name. It must always try to use the new one first, and falling back to the old one when necessary. When writing, it should always write using the new and old column name.
Here’s how you deploy:
A very similar one as the previous. Let’s say you used S3 so far, but want to migrate to another S3 compatible storage service from another vendor. Your service may consist of thousands of machines.
Again, to do this, you introduce a code change: writes now go to the new and old storage service. Reads go to the new, but fall back to the old when a resource is not found. Deletes also go to both.
The migration plan goes like this:
Note that you might need to increase the number of machines for your service temporarily, because of the more expensive data storage operations.
Looking at the examples, specifically the last one, one thing has become clear by now: to do migrations like these, you need to own the code. If you’re operating a system that you cannot change, you won’t be able to make changes that help interruption-free migrations.
It’s for that reason probably, that newer forms of organizations, e.g. like Amazon’s two-pizza teams that follow a “You build it, you run it” culture are better suited for such interruption-free service migrations than the more classic ones where there is a strict separation between development and operations.
It’s also obvious that these kinds of migration plans require teams to be able to do frequent deployments with small changes. At any point in time during such migration you must be able to stop or rollback. That’s difficult when your update contains other unrelated changes and when your deployment process is manual and slow.
It also helps to have a mechanism to deploy to so-called canaries: single machines that get the update first. Only if they prove working correctly will the update be rolled out on to more machines.
If you wonder what can help you do this kind of deployments, use BOSH. It probably cannot compete in scale with Amazon’s internal deployment system, Apollo. But it’s one of the most powerful you can currently get, especially being free and open source.
Cashiers at your local supermarket can give you the blueprint for doing interruption-free service migrations. Using this blueprint, you can break down your (potentially huge) migration into small changes that provide seamless, interruption-free migrations. Embracing an organizational structure that merges developers and operators helps to implement such small changes that aid the actual migration. Finally, use a powerful deployment system like BOSH to deploy the changes in an automated way.