Chaos has become a symptom of the tech world. Every day, thousands of developers are putting out fires at work and getting caught up in one crisis after another.
The better part of those fires have been lit by the rise of microservices and distributed cloud architectures. The popularity of those advancements is at an all-time high, yet failures continue to be prominent and complex.
According to an IHS Markit survey, the cost of downtime for 400 companies hit a collective $700 billion per year. This is a staggering figure.
We all need a magic pill to alleviate this headache —waiting for your service to crash is a bleak option.
Let’s do it the Netflix way and chill during deployment.
Welcome to chaos engineering - a place where mistakes are intentional and failures are embraced.
Its history dates back to 2010 when the Netflix Eng Tools team created Chaos Monkey to test the resilience of its IT infrastructure. Today, chaos engineering is ‘celebrating failure’ to help engineers and systems build muscle memory and maintain more resilient complex systems.
In layman’s terms, chaos engineering is the process of hacking things on purpose.
Just like a vaccination, you inject latency or CPU failure to trigger an immune response within the system.
In this case, our main goal lies in identifying hidden problems that may wreck production.
As a сhaos engineer, you test the system's ability to handle real-world problems - server errors, traffic jumps, corrupted messages - in a series of controlled experiments.
To stress your system out, you need to follow a four-step process:
Define the steady-state of the system. Develop a profound understanding of a system so that you are aware of what it looks like during normal functioning. This state will serve as a measurable variable.
Build a hypothesis around steady-state. Choose the damaging action you want to enact. Simulate realistic scenarios. Replicate real-life problems that have previously occurred in your system. For example, if traffic spikes caused havoc a few months ago, opt for bugs that mimic those affects.
Measure the impact. Keep tabs on your system while the bug is attacking it. Focus on key metrics, but don’t forget to assess the entire system.
Minimize the blast radius. Safeguard the infrastructure by coordinating developer teams and business units. Furthermore, you should start small and build up as you gain confidence in a system.
Invalidate your hypothesis. Finally, you’ll have one of the two outcomes. You either confirm the resilience of the system, or you find a weak point to eliminate.
Pro tip: Run chaos experiments in production to replicate the real state of things. If you perform chaos testing during staging or integration, you won’t build a real vision of how the system in production reacts.
Awesome! We’ve successfully shattered your application using controlled chaos and debunked the concept of chaos engineering. Next, you would want to right the wrongs to make your system invincible.
Credit for the above piece goes to Tatsiana Isakova, Hang Ngo, and Ellen Stevens.
Subscribe to HackerNoon’s thematic newsletters via our subscribe form in the footer.