Chaos Testing for Docker Containers

A few weeks ago I presented, Chaos Testing for Docker Containers at ContainerCamp in London. You can find the original recording and slides at the end of this post and I’ve made some small edits to the text for readability and added some links for more context.

Intro

Software development is about building software services that support business needs. The more complex businesses processes we want to automate and integrate with, the more complex the software system we build. Solution complexity tends to grow over time and scope.

The reasons for growing complexity can vary. Some systems handle too many concerns or require lots of integrations with external services and internal legacy systems. These systems are written and rewritten multiple times over several years by different people with different skills, trying to satisfy changing business requirements, using different technologies, following different technology and architecture trends.

So, my point is that building software, that unintentionally becomes more and more complex over time, is easy. We’ve all done it before and probably do it now. Building a “good” software architecture for complex systems and preserving it’s “good” abilities for some period of time, is really hard.

When you have too many “moving” parts, integrations and constantly changing requirements, while dealing with code changes, security upgrades, hardware modernization, multiple network communication channels and etc, it can become a “Mission Impossible” to avoid unexpected failures.

Stuff happens!

All systems fail from time to time. And your software system will fail too. Take this as a fact of life. There will always be something that can — and will — go wrong. No matter how hard we try, we can’t build perfect software, nor can the companies we depend on. Even the most stable and respected services from companies, that practice CI/CD, test-driven development (TDD/BDD), have huge QA departments and well-defined release procedures, fail.

Just a few examples from the last year outages:

IBM, January 26

IBM’s cloud credibility took a hit at the start of the year when a management portal used by customers to access its Bluemix cloud infrastructure went down for several hours. While no underlying infrastructure actually failed, users were frustrated to find they couldn’t manage their applications, add or remove cloud resources powering workloads.
IBM said the problem was intermittent and stemmed from a botched update to the interface.

2. GitLab, January 31

GitLab’s popular online code repository, GibLab.com, suffered an 18-hour service outage that ultimately couldn’t be fully remediated.
The problem resulted when an employee removed a database directory from the wrong database server during maintenance procedures.

3. AWS, February 28

This was the outage that shook the industry.
An Amazon Web Services engineer trying to debug an S3 storage system in the provider’s Virginia data center accidentally typed a command incorrectly, and much of the Internet including many enterprise platforms like Slack, Quora, and Trello was down for four hours.

4. Microsoft Azure, March 16

Storage availability issues plagued Microsoft’s Azure public cloud for more than eight hours, mostly affecting customers in the Eastern U.S.
Some users had trouble provisioning new storage or accessing existing resources in the region. A Microsoft engineering team later identified the culprit as a storage cluster that lost power and became unavailable.

Visit Outage.Report or Downdetector to see a constantly updating long list of outages reported by end-users.

Chasing Software Quality

As software engineers, we want to be proud of software systems we are building. We want theses systems to be of high quality, without functional bugs, security holes, providing exceptional performance, resilient to unexpected failures, self-healing, always available and easy to maintain and modernize.

Every new project starts with “high quality” picture in mind and no one wants to create crappy software, but very few of us (or none) are able to achieve and keep intact all good “abilities”. So, what we can do to improve overall system quality? Should we do more testing?

I tend to say “Yes” — software testing is critical. But just running unit, functional and performance testing is not enough.

Today, building complex distributed system is much easier with all the new amazing technology we have. Microservice Architecture is a real trend nowadays and miscellaneous container technologies support this architecture. It’s much easier to deploy, scale, link, monitor, update and manage distributed systems, composed from multiple “microservices” than it used to be.

When we build distributed systems, we choose P (Partition Tolerance) from the CAP theorem and second to it either A (Availability — the most popular choice) or C (Consistency). So, we need to find a good approach for testing AP or CP systems.

Traditional testing disciplines and tools do not provide a good answer to how does your distributed system behave when unexpected stuff happens in production?. Sure, you can learn from previous failures, after the fact, and you should definitely do it. But, learning from past experience should not be the only way to prepare for the future failures.

Waiting for things to break in production is not an option. But what’s the alternative?

Chaos Engineering

The alternative is to break things on purpose. And Chaos Engineering is an approach for doing just that. The idea of Chaos Engineering is to embrace the failure!

Chaos Engineering for distributed software systems was originally popularized by Netflix.

Chaos Engineering defines an empirical approach to resilience testing of distributed software systems. You are testing a system by conducting chaos experiments.

Typical chaos experiment:

define a normal/steady state of the system (e.g. by monitoring a set of system and business metrics)
pseudo-randomly inject faults (e.g. by terminating VMs, killing containers or changing network behavior)
try to discover system weaknesses by deviation from expected or steady-state behavior

The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system.

Chaos Engineering tools

Of course, it’s possible to practice Chaos Engineering manually. But there are some nice tools to use.

Netflix built a number useful tools for practicing Chaos Engineering in a public cloud (AWS):

Chaos Monkey — kill EC2, kill processes, burn CPU, fill disk, detach volumes, add network latency, etc
Chaos Kong — remove whole AWS Regions

These are very good tools, I encourage you to use them. But when I started my new container-based project (2 years ago), it felt like these tools provided the wrong granularity for chaos I wanted to create. I wanted to create chaos not only in real cluster, but also on single developer machine, to be able to debug and tune my application. I searched Google for Chaos Monkey for Docker, but did not find anything besides some basic Bash scripts.

So, I decided to create my own tool. From day one, I’ve shared it with the community as an open source project. It’s a Chaos ~~Monkey~~ Warthog for Docker — Pumba

Pumba — Chaos Testing for Docker

What is Pumba(a)?

Those of us who have kids or were kids in 90s should remember this character from Disney’s animated film The Lion King. In Swahili, pumbaa means “to be foolish, silly, weak-minded, careless, negligent“. I like the Swahili meaning. It matched perfectly with the tool I wanted to create.

What Pumba can do?

Pumba disturbs running Docker runtime environment by injecting different failures. Pumba can kill, stop, remove or pause Docker containers.

Pumba can also do a network emulation, simulating different network failures, like: delay, packet loss (using different probability loss models), bandwidth rate limits and more. For network emulation, Pumba uses Linux kernel traffic control tc with netem queueing discipline, read more here. If tc is not available within target container, Pumba uses a sidekick container with tc on-board, attaching it to the target container network.

You can pass list of containers to Pumba or just write a regular expression to select matching containers. If you do not specify containers, Pumba will try to disturb all running containers. Use --random option, to randomly select only one target containers from a provided list. It’s also possible to define a repeatable time interval and duration parameters to better control the amount of chaos you want to create.

Pumba is available as a single binary file for Linux, MacOS and Windows, or as a Docker container.

# Download binary from https://github.com/gaia-adm/pumba/releasescurl https://github.com/gaia-adm/pumba/releases/download/0.4.6/pumba_linux_amd64 --output /usr/local/bin/pumbachmod +x /usr/local/bin/pumba && pumba --help# Install with Homebrew (MacOS only)brew install pumba && pumba --help# Use Docker imagedocker run gaiaadm/pumba pumba --help

Pumba commands examples

First of all, run pumba --help to get help about available commands and options and pumba <command> --help to get help for the specific command and sub-command.

# pumba helppumba --help# pumba kill helppumba kill --help# pumba netem delay helppumba netem delay --help

Killing randomly chosen Docker container from ^test regex list.

# on main pane/screen, run 7 test containers that do nothingfor i in {0..7}; do docker run -d --rm --name test$i alpine tail -f /dev/null; done# run an additional container with 'skipme' namedocker run -d --rm --name skipme alpine tail -f /dev/null# run this command in another pane/screen to see running docker containerswatch docker ps -a# go back to main pane/screen and kill (once in 10s) random 'test' container, ignoring 'skipme'pumba --random --interval 10s kill re2:^test# press Ctrl-C to stop Pumba at any time

Adding a 3000ms (+-50ms) delay to the engress traffic for the ping container for 20 seconds, using normal distribution model.

# run "ping" container on one screen/panedocker run -it --rm --name ping alpine ping 8.8.8.8# on second screen/pane, run pumba netem delay command, disturbing "ping" container; sidekick a "tc" helper containerpumba netem --duration 20s --tc-image gaiadocker/iproute2 delay --time 3000 jitter 50 --distribution normal ping# pumba will exit after 20s, or stop it with Ctrl-C

To demonstrate packet loss capability, we will need three screens/panes. I will use iperf network bandwidth measurement tool. On the first pane, run server docker container with iperf on-board and start there a UDP server. On the second pane, start client docker container with iperf and send datagrams to the server container. Then, on the third pane, run pumba netem loss command, adding a packet loss to the client container. Enjoy the chaos.

# create docker networkdocker network create -d bridge testnet# > Server Pane# run server containerdocker run -it --name server --network testnet --rm alpine sh -c "apk add --no-cache iperf; sh"# shell inside server container: run a UDP Server listening on UDP port 5001sh$ iperf -s -u -i 1# > Client Pane# run client containerdocker run -it --name client --network testnet --rm alpine sh -c "apk add --no-cache iperf; sh"# shell inside client container: send datagrams to the server -> see no packet losssh$ iperf -c server -u# > Server Pane# see server receives datagrams without any packet loss# > Pumba Pane# inject 20% packet loss into client container, for 1mpumba netem --duration 1m --tc-image gaiadocker/iproute2 loss --percent 20 client# > Client Pane# shell inside client container: send datagrams to the server -> see ~20% packet losssh$ iperf -c server -u

Session and slides

ContainerCamp UK 2017 session

Slides from above session

Chaos Engineering for Docker from Alexei Ledenev

Hope, you find this post useful. I look forward to your comments and any questions you have.

Originally published at codefresh.io on October 4, 2017.