A few weeks ago I presented, Chaos Testing for Docker Containers at ContainerCamp in London. You can find the original recording and slides at the end of this post and I’ve made some small edits to the text for readability and added some links for more context.
Software development is about building software services that support business needs. The more complex businesses processes we want to automate and integrate with, the more complex the software system we build. Solution complexity tends to grow over time and scope.
The reasons for growing complexity can vary. Some systems handle too many concerns or require lots of integrations with external services and internal legacy systems. These systems are written and rewritten multiple times over several years by different people with different skills, trying to satisfy changing business requirements, using different technologies, following different technology and architecture trends.
So, my point is that building software, that unintentionally becomes more and more complex over time, is easy. We’ve all done it before and probably do it now. Building a “good” software architecture for complex systems and preserving it’s “good” abilities for some period of time, is really hard.
When you have too many “moving” parts, integrations and constantly changing requirements, while dealing with code changes, security upgrades, hardware modernization, multiple network communication channels and etc, it can become a “Mission Impossible” to avoid unexpected failures.
All systems fail from time to time. And your software system will fail too. Take this as a fact of life. There will always be something that can — and will — go wrong. No matter how hard we try, we can’t build perfect software, nor can the companies we depend on. Even the most stable and respected services from companies, that practice CI/CD, test-driven development (TDD/BDD), have huge QA departments and well-defined release procedures, fail.
Just a few examples from the last year outages:
2. GitLab, January 31
3. AWS, February 28
4. Microsoft Azure, March 16
Visit Outage.Report or Downdetector to see a constantly updating long list of outages reported by end-users.
As software engineers, we want to be proud of software systems we are building. We want theses systems to be of high quality, without functional bugs, security holes, providing exceptional performance, resilient to unexpected failures, self-healing, always available and easy to maintain and modernize.
Every new project starts with “high quality” picture in mind and no one wants to create crappy software, but very few of us (or none) are able to achieve and keep intact all good “abilities”. So, what we can do to improve overall system quality? Should we do more testing?
I tend to say “Yes” — software testing is critical. But just running unit, functional and performance testing is not enough.
Today, building complex distributed system is much easier with all the new amazing technology we have. Microservice Architecture is a real trend nowadays and miscellaneous container technologies support this architecture. It’s much easier to deploy, scale, link, monitor, update and manage distributed systems, composed from multiple “microservices” than it used to be.
When we build distributed systems, we choose P (Partition Tolerance) from the CAP theorem and second to it either A (Availability — the most popular choice) or C (Consistency). So, we need to find a good approach for testing AP or CP systems.
Traditional testing disciplines and tools do not provide a good answer to how does your distributed system behave when unexpected stuff happens in production?. Sure, you can learn from previous failures, after the fact, and you should definitely do it. But, learning from past experience should not be the only way to prepare for the future failures.
Waiting for things to break in production is not an option. But what’s the alternative?
The alternative is to break things on purpose. And Chaos Engineering is an approach for doing just that. The idea of Chaos Engineering is to embrace the failure!
Chaos Engineering for distributed software systems was originally popularized by Netflix.
Chaos Engineering defines an empirical approach to resilience testing of distributed software systems. You are testing a system by conducting chaos experiments.
Typical chaos experiment:
The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system.
Of course, it’s possible to practice Chaos Engineering manually. But there are some nice tools to use.
Netflix built a number useful tools for practicing Chaos Engineering in a public cloud (AWS):
These are very good tools, I encourage you to use them. But when I started my new container-based project (2 years ago), it felt like these tools provided the wrong granularity for chaos I wanted to create. I wanted to create chaos not only in real cluster, but also on single developer machine, to be able to debug and tune my application. I searched Google for Chaos Monkey for Docker, but did not find anything besides some basic Bash scripts.
So, I decided to create my own tool. From day one, I’ve shared it with the community as an open source project. It’s a Chaos Monkey Warthog for Docker — Pumba
What is Pumba(a)?
Those of us who have kids or were kids in 90s should remember this character from Disney’s animated film The Lion King. In Swahili, pumbaa means “to be foolish, silly, weak-minded, careless, negligent“. I like the Swahili meaning. It matched perfectly with the tool I wanted to create.
Pumba disturbs running Docker runtime environment by injecting different failures. Pumba can kill
, stop
, remove
or pause
Docker containers.
Pumba can also do a network emulation, simulating different network failures, like: delay, packet loss (using different probability loss models), bandwidth rate limits and more. For network emulation, Pumba uses Linux kernel traffic control tc
with netem
queueing discipline, read more here. If tc
is not available within target container, Pumba uses a sidekick container with tc
on-board, attaching it to the target container network.
You can pass list of containers to Pumba or just write a regular expression to select matching containers. If you do not specify containers, Pumba will try to disturb all running containers. Use --random
option, to randomly select only one target containers from a provided list. It’s also possible to define a repeatable time interval and duration parameters to better control the amount of chaos you want to create.
Pumba is available as a single binary file for Linux, MacOS and Windows, or as a Docker container.
# Download binary from https://github.com/gaia-adm/pumba/releasescurl https://github.com/gaia-adm/pumba/releases/download/0.4.6/pumba_linux_amd64 --output /usr/local/bin/pumbachmod +x /usr/local/bin/pumba && pumba --help# Install with Homebrew (MacOS only)brew install pumba && pumba --help# Use Docker imagedocker run gaiaadm/pumba pumba --help
First of all, run pumba --help
to get help about available commands and options and pumba <command> --help
to get help for the specific command and sub-command.
# pumba helppumba --help# pumba kill helppumba kill --help# pumba netem delay helppumba netem delay --help
Killing randomly chosen Docker container from ^test
regex list.
# on main pane/screen, run 7 test containers that do nothingfor i in {0..7}; do docker run -d --rm --name test$i alpine tail -f /dev/null; done# run an additional container with 'skipme' namedocker run -d --rm --name skipme alpine tail -f /dev/null# run this command in another pane/screen to see running docker containerswatch docker ps -a# go back to main pane/screen and kill (once in 10s) random 'test' container, ignoring 'skipme'pumba --random --interval 10s kill re2:^test# press Ctrl-C to stop Pumba at any time
Adding a 3000ms
(+-50ms
) delay to the engress traffic for the ping
container for 20
seconds, using normal distribution model.
# run "ping" container on one screen/panedocker run -it --rm --name ping alpine ping 8.8.8.8# on second screen/pane, run pumba netem delay command, disturbing "ping" container; sidekick a "tc" helper containerpumba netem --duration 20s --tc-image gaiadocker/iproute2 delay --time 3000 jitter 50 --distribution normal ping# pumba will exit after 20s, or stop it with Ctrl-C
To demonstrate packet loss capability, we will need three screens/panes. I will use iperf
network bandwidth measurement tool. On the first pane, run server docker container with iperf
on-board and start there a UDP server. On the second pane, start client docker container with iperf
and send datagrams to the server container. Then, on the third pane, run pumba netem loss
command, adding a packet loss to the client container. Enjoy the chaos.
# create docker networkdocker network create -d bridge testnet# > Server Pane# run server containerdocker run -it --name server --network testnet --rm alpine sh -c "apk add --no-cache iperf; sh"# shell inside server container: run a UDP Server listening on UDP port 5001sh$ iperf -s -u -i 1# > Client Pane# run client containerdocker run -it --name client --network testnet --rm alpine sh -c "apk add --no-cache iperf; sh"# shell inside client container: send datagrams to the server -> see no packet losssh$ iperf -c server -u# > Server Pane# see server receives datagrams without any packet loss# > Pumba Pane# inject 20% packet loss into client container, for 1mpumba netem --duration 1m --tc-image gaiadocker/iproute2 loss --percent 20 client# > Client Pane# shell inside client container: send datagrams to the server -> see ~20% packet losssh$ iperf -c server -u
Chaos Engineering for Docker from Alexei Ledenev
Hope, you find this post useful. I look forward to your comments and any questions you have.
Originally published at codefresh.io on October 4, 2017.