Your infrastructure will fail. It’s not an if but a when. As the rise of microservices and serverless make apps more distributed, potential fault points are rising exponentially. We may attempt to engineer our systems expecting certain failures only to make things worse, such as a well-intended retry logic overloading a stressed server even worse and causing failures to cascade across the enterprise.
In the old days you had a person or one team who understood your system so well that they could engineer against most failures or immediately diagnose and fix the unexpected production failures that did slip though. That was possible with a monolithic app. With a microservices architecture, those days are coming to an end. We now need an approach to system resilience that assumes the system is too complex for humans to understand and assumes things will break in ways we cannot predict. Chaos engineering is a methodology that takes that approach.
Chaos engineering is a methodology that discovers your system’s faults by intentionally injecting problems into production systems in a controlled manner. Faults are wide ranging, from latency, simulated disk failure, node outage, and even simulating the outage of an entire region.
Benefits of Chaos Engineering
Pioneered by Netflix, it is still a fairly new methodology. Many firms have shown interest in it, but few outside the tech industry are executing it. Of those that have, few have reached a high level of maturity. Yet firms I have spoken with who have either have implemented it or have clients who have implemented it all spoke positively of chaos engineering and see strong value. Here are some benefits they found.
Characteristics of a Chaos Test
Although some people think chaos engineering is a wild west of just
going into production and seeing what happens, that is absolutely not
the case. There is a lot of infrastructure maturity you must have in
place as a prerequisite. Every chaos test has the following key
characteristics, and your infrastructure must have the maturity to meet
these characteristics before you can execute chaos testing.
Reaching that level of maturity is difficult and is likely why many companies have not fully embraced it yet. You should be able to see from those steps that it requires a maturity in public cloud or containers, CI/CD, and end-to-end monitoring that many companies aspire to have for many reasons besides chaos engineering but have not yet attained.
Nonetheless, like any IT investment, sometimes the highest level of maturity does not bring enough value to your unique situation to justify the cost. Many companies find value from going partway there, similar to how many companies are decomposing their monoliths without going full blown microservices. One company I’ve talked to, for example, did significantly more chaos testing in a prod-mirror acceptance environment than true production. They found this produced sufficient value that they saw no need to invest further maturing production chaos testing.
A common argument against chaos engineering is that we shouldn’t run it against critical systems because we can’t risk an outage caused by a failed chaos test. While this seems intuitive, it is very wrong and stems from a misunderstanding of what a chaos test is. Recall that you do not run chaos tests that you expect to fail, you run them on a small subset of your traffic, and you have a way to abort the test immediately if things go haywire.
If you’re too worried to run a chaos test on a critical system, even with a test group as small as 0.1% traffic, then you must expect it to fail for some reason or else you wouldn’t have that fear. Thus you have not reached the point where a chaos test is appropriate, but you have also validated the need for chaos testing: you don’t have confidence in the system’s stability! Determine what that fear is, then do tests in non-prod to either identify the faults you need to fix or convince yourself it really is stable enough to run in production, your fear having been unwarranted.
Finally, weigh the cost of not doing chaos testing against the cost of a failed chaos test. For example, suppose a retailer ran a test on their billing system with 0.5% of their traffic in the test group. The test passes in non-prod, but then in prod it fails. They then lose 20% of those test group purchases (or 0.1% of all purchases) over a 10 minute period before aborting the test.
Yes, those handful of lost sales are a bad outcome. But consider the alternative: had they not found and fixed this issue, then come Black Friday their system might have failed for real, thus losing 20% of those sales all day long with no way to abort. That makes the bad impact of a failed production chaos test sound pretty good!
If you’re at the point where you’re ready to take your first steps into chaos testing, how do you start? For a much deeper reading I highly recommend Chaos Engineering: Building Confidence in System Behavior through Experiments by Casey Rosenthal et al if you can find a copy. Unfortunately, it appears to be out of print, but he and Nora Jones have published a longer book Chaos Engineering: System Resiliency in Practice that perhaps goes even more in depth.
There are open source tools for chaos engineering, also some vendors. The original tool from Netflix is called the Chaos Monkey, which Netflix released as open source. It integrates with Spinnaker to randomly terminate VMs and container instances.
A growing chaos engineering community can be found here on GitHub, and from there you can find many other tools targeting different tech stacks and different types of chaos. As for commercial vendors, Gremlin is the only company I am aware of in this space.
Chaos engineering is an interesting development in IT. I expect we will see it grow it in the coming years. Developing basic chaos engineering skills and maturity now will get your company ahead of your competition and lay the foundation for the rapid growth of a future microservices architecture.
Previously published at https://medium.com/swlh/an-architects-introduction-to-chaos-engineering-21b9ee20d997