can do it, e can do it, a lot of others can do it too: interruption-free service migrations. Amazon Googl So why are so many enterprise businesses, banks, and others not able to do it? Why do I still get a notification by my bank that there will be service interruptions between Saturday 4 p.m. and Sunday 10 a.m. Why is there still that senior manager asking the question upon a change request: "will that change be disruptive?" 20 years ago, that might have been a reasonable question, but when was the last time you went to and got a “service down due to maintenance”? google.com I do not know why we are still talking about this. However, I do know how to do interruption-free service migrations. It’s very simple: Just watch and learn from the cashiers at your local supermarket Let’s say they have 4 counters. #1, 2, and 3 are open and have a cashier behind them. Now the cashier behind counter #2 is done for the day and needs a replacement. stop everything, take the cash drawer out, wait for her replacement, who has to put in his cash drawer, all while the whole queue of customers on that counter is waiting. Here’s what the cashier won’t do: A fourth cashier will open counter #4 and start serving customers. Next, the cashier at counter #2 will put up a sign to avoid getting more customers in the queue. She will still serve the remaining customers in her queue. Then she will close it. In fact, this is what will happen: closed Notice how this is a 4-step process: Open a new service Route traffic to the new service Serve all remaining traffic in the old service Close the old service Also, notice how this leaves all possibilities open: The cashier at counter #2 could suddenly decide to re-open, or not close at all, while the cashier at counter #4 still serves. Or the cashier at counter #4 could decide that it wasn’t a good idea, put a sign and leave after the remaining queue was served. It’s all seamless. closed Production Systems So why do we still have these nail-biting upgrade processes in production systems? These all-or-nothing-high-risk-no-rollback upgrade processes, potentially with service interruption, when those cashiers provide us with the perfect blueprint of how to do it? Okay — I do realize that production systems are a bit more complex than what I described above. But as always, the way to deal with complexity, is to break it down into small, digestible chunks. This is the hard part. Once that is done, any service upgrade can be done gradually with low-risk and rollback at any point in time, just like the cashiers could do it above. Let me try to use some examples in the following to make my point clear. Examples Changing Certificates Almost any non-trivial system needs to deal with certificates. The problem is always, how do you make sure your clients your servers get updated simultaneously? and The answer is, they don’t. Instead, the side verifying the validity of a certificate should allow a new one an old one for the transition period. and Here are the steps: add new certificate to the the list of accepted certificates on the client (assuming this is the verifying side) On server, replace old certificate with new one Wait for all requests using the old certificate to finish on client remove old certificate from client Update DB Schema Let’s say you have 1000 machines accessing an SQL database to feed a web app. A new version of the web app uses a renamed column in the database. So you need to rename that column in the database and simultaneously upgrade all your 1000 machines to the new web app version. How can you do that? You can’t. Instead, your new version must be able to understand a schema which contains both the new column name and the old column name. It must always try to use the new one first, and falling back to the old one when necessary. When writing, it should always write using the new and old column name. Here’s how you deploy: Add the new column to the database. (Open counter #4) Deploy your new web app version to your machines in staggered manner, potentially with a canary (Reroute traffic to counter #4, put up a ) closed sign Have a background job that copies all old column values to the new one. (Serve all remaining customers at counter #2) Remove the fallback mechanism in the web app and deploy again (again staggered). (Close counter #2) Optionally: Remove the old column. (Tear down counter #2 :-)) A New Storage Service Backend A very similar one as the previous. Let’s say you used S3 so far, but want to migrate to another S3 compatible storage service from another vendor. Your service may consist of thousands of machines. Again, to do this, you introduce a code change: writes now go to the new old storage service. Reads go to the new, but fall back to the old when a resource is not found. Deletes also go to both. and The migration plan goes like this: Set up the new data store. (Open counter #4) Deploy your updated system to your machines (Reroute traffic to counter #4, put up a ) closed sign Have a background that copies all your data from the old storage service to the new one (Serve all remaining customers at counter #2) Remove the code change, configure the new data store. (Close counter #2) only Note that you might need to increase the number of machines for your service temporarily, because of the more expensive data storage operations. Implications Looking at the examples, specifically the last one, one thing has become clear by now: to do migrations like these, you need to own the code. If you’re operating a system that you cannot change, you won’t be able to make changes that help interruption-free migrations. It’s for that reason probably, that newer forms of organizations, e.g. like Amazon’s two-pizza teams that follow a “You build it, you run it” culture are better suited for such interruption-free service migrations than the more classic ones where there is a strict separation between development and operations. It’s also obvious that these kinds of migration plans require teams to be able to do frequent deployments with small changes. At any point in time during such migration you must be able to stop or rollback. That’s difficult when your update contains other unrelated changes and when your deployment process is manual and slow. It also helps to have a mechanism to deploy to so-called canaries: single machines that get the update first. Only if they prove working correctly will the update be rolled out on to more machines. If you wonder what can help you do this kind of deployments, use . It probably cannot compete in scale with Amazon’s internal deployment system, . But it’s one of the most powerful you can currently get, especially being free and open source. BOSH Apollo Conclusion Cashiers at your local supermarket can give you the blueprint for doing interruption-free service migrations. Using this blueprint, you can break down your (potentially huge) migration into small changes that provide seamless, interruption-free migrations. Embracing an organizational structure that merges developers and operators helps to implement such small changes that aid the actual migration. Finally, use a powerful deployment system like to deploy the changes in an automated way. BOSH

Interruption-free service migrations: what you can learn from the cashiers at your local…

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Correct Error Handling is Hard

10 Practical Steps to Build a Robust Cloud-Based Business

7 Effective Tips to Secure Your Data in the Cloud

9 Cloud Computing Trends that will Take 2017 by Storm!

AWS Landing Zone Solution -Accelerating Cloud Adoption

AWS vs. DigitalOcean: Which Cloud Server is Better

Correct Error Handling is Hard

10 Practical Steps to Build a Robust Cloud-Based Business

7 Effective Tips to Secure Your Data in the Cloud

9 Cloud Computing Trends that will Take 2017 by Storm!

AWS Landing Zone Solution -Accelerating Cloud Adoption

AWS vs. DigitalOcean: Which Cloud Server is Better

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps