cloud architect, open-source developer
Updated on 28-July-2016: Pumba cli change; support network emulation.
The best defense against unexpected failures is to build resilient services. Testing for resiliency enables teams to discover these failures before the customer notices. By intentionally causing failures as part of resiliency testing, you can enforce your policy for building resilient systems. Resilience of the system can be defined as its ability to continue functioning even if some components of the system are failing — ephemerality. The growing popularity of distributed and microservice architecture makes resilience testing critical for applications that now require 24x7x365 operation. Resilience testing is an approach where you intentionally inject different types of failures at the infrastructure level (VM, network, containers, and processes) and let the system try to recover from these unexpected failures that can happen in production. Simulating realistic failures at any time is the best way to enforce highly available and resilient systems.
First of all, Pumba (or Pumbaa) is a supporting character from Disney’s animated film The Lion King. In Swahili, pumbaa means “to be foolish, silly, weakminded, careless, negligent”. And this actually reflects the desired behavior of application, I’ve tried to build :-)
Pumba is inspired by highly popular Netfix Chaos Monkey resilience testing tool for AWS cloud. Pumba takes a similar approach, but applies it to container level. It connects to the Docker daemon running on some machine (local or remote) and brings some level of chaos to it: “randomly” killing, stopping and removing running containers.
If your system is designed to be resilient, it should be able to recover from such failures. “Failed” services should be restarted and lost connections should be recovered. This is not as trivial as it sounds. You need to design your services differently. Be aware that a service can fail (for whatever reason) or service it depends on can disappear at any point of time (but can reappear later). Expect the unexpected!
Failures happen and they inevitably happen when least desired. If your application cannot recover from system failures, you are going to face angry customers and maybe even lose them. If you want to be sure that your system is able to recover from unexpected failures, it would be better to take charge of them and inject failures yourself instead of waiting till they happen. This is not a one time effort. At age of Continuous Delivery, you need to be sure that every change to any one of system services, does not compromise system availability. That’s why you should practice continuous resilience testing. With Docker gaining popularity as people are deploying and running clusters of containers in production. Using a container orchestration network (e.g. Kubernetes, Swarm, CoreOS fleet), it’s possible to restart a “failed” container automatically. How can you be sure that restarted services and other system services can properly recover from failures? If you are not using container orchestration frameworks, life is even harder: you will need to handle container restarts by yourself.
This is where Pumba shines. You can run it on every Docker host, in your cluster, and Pumba will “randomly” stop running containers — matching specified name/s or name patterns. You can even specify the signal that will be sent to “kill” the container.
Pumba can create different failures for your running Docker containers. Pumba can kill, stop or remove running containers. It can also pause all processes within running container for a specified period of time. Pumba can also do network emulation, simulating different network failures, like: delay, packet loss/corruption/reorder, bandwidth limits and more. Disclaimer: netem command is under development and only delay command is supported in Pumba v0.2.0.
You can pass a list of containers to Pumba or just write a regular expression to select matching containers. If you will not specify containers, Pumba will try to disturb all running containers. Use --random option, to randomly select only one target container from the provided list.
There are two ways to run Pumba.
First, you can download Pumba application (single binary file) for your OS from the project release page and run pumba --help to see a list of supported commands and options.
$ pumba help
Pumba version v0.2.0
Pumba - Pumba is a resilience testing tool, that helps applications tolerate random Docker container failures: process, network and performance.
pumba [global options] command [command options] containers (name, list of names, RE2 regex)
kill kill specified containers
netem emulate the properties of wide area networks
pause pause all processes
stop stop containers
rm remove containers
help, h Shows a list of commands or help for one command
--host value, -H value daemon socket to connect to (default: "unix:///var/run/docker.sock") [$DOCKER_HOST]
--tls use TLS; implied by --tlsverify
--tlsverify use TLS and verify the remote [$DOCKER_TLS_VERIFY]
--tlscacert value trust certs signed only by this CA (default: "/etc/ssl/docker/ca.pem")
--tlscert value client certificate for TLS authentication (default: "/etc/ssl/docker/cert.pem")
--tlskey value client key for TLS authentication (default: "/etc/ssl/docker/key.pem")
--debug enable debug mode with verbose logging
--json produce log in JSON format: Logstash and Splunk friendly
--slackhook value web hook url; send Pumba log events to Slack
--slackchannel value Slack channel (default #pumba) (default: "#pumba")
--interval value, -i value recurrent interval for chaos command; use with optional unit suffix: 'ms/s/m/h'
--random, -r randomly select single matching container from list of target containers
--dry dry runl does not create chaos, only logs planned chaos commands
--help, -h show help
--version, -v print the version
# stop random container once in a 10 minutes
$ ./pumba --random --interval 10m kill --signal SIGSTOP
# every 15 minutes kill `mysql` container and
# every hour remove containers starting with "hp"
$ ./pumba --interval 15m kill --signal SIGTERM mysql &
$ ./pumba --interval 1h rm re2:^hp &
# every 30 seconds kill "worker1" and "worker2" containers
# and every 3 minutes stop "queue" container
$ ./pumba --interval 30s kill --signal SIGKILL worker1 worker2 &
$ ./pumba --interval 3m stop queue &
# Once in 5 minutes, Pumba will delay for 2 seconds (2000ms)
# egress traffic for some (randomly chosen) container,
# named `result...` (matching `^result` regexp) on `eth2`
# network interface.
# Pumba will restore normal connectivity after 2 minutes.
# Print debug trace to STDOUT too.
$ ./pumba --debug --interval 5m --random netem --duration 2m --interface eth2 delay --amount 2000 re2:^result
The second approach to run it in a Docker container.
In order to give Pumba access to the Docker daemon on the host machine, you will need to mount var/run/docker.sock unix socket.
# run latest stable Pumba docker image (from master repository)
$ docker run -d \
-v /var/run/docker.sock:/var/run/docker.sock \
gaiaadm/pumba:master pumba \
kill --interval 10s --signal SIGTERM ^hp
Pumba will not kill its own container.
I’ve just created the Pumba project and will gladly accept any ideas, Pull Requests, issues and other contributions to the project.
Originally published at blog.terranillius.com.