A few weeks ago I presented, at in London. You can find the original recording and slides at the end of this post and I’ve made some small edits to the text for readability and added some links for more context. Chaos Testing for Docker Containers ContainerCamp Intro Software development is about building software services that support business needs. The more complex businesses processes we want to automate and integrate with, the more complex the software system we build. Solution complexity tends to grow over time and scope. The reasons for growing complexity can vary. Some systems handle too many concerns or require lots of integrations with external services and internal legacy systems. These systems are written and rewritten multiple times over several years by different people with different skills, trying to satisfy changing business requirements, using different technologies, following different technology and architecture trends. So, my point is that building software, that unintentionally becomes more and more complex over time, is easy. We’ve all done it before and probably do it now. Building a “good” software architecture for complex systems and preserving it’s “good” abilities for some period of time, is really hard. When you have too many “moving” parts, integrations and constantly changing requirements, while dealing with code changes, security upgrades, hardware modernization, multiple network communication channels and etc, it can become a “Mission Impossible” to avoid unexpected failures. Stuff happens! All systems fail from time to time. And your software system will fail too. Take this as a fact of life. There will always be something that can — and will — go wrong. No matter how hard we try, we can’t build perfect software, nor can the companies we depend on. Even the most stable and respected services from companies, that practice CI/CD, test-driven development (TDD/BDD), have huge QA departments and well-defined release procedures, fail. Just a few examples from the last year outages: IBM, January 26 IBM’s cloud credibility took a hit at the start of the year when a management portal used by customers to access its Bluemix cloud infrastructure went down for several hours. While no underlying infrastructure actually failed, users were frustrated to find they couldn’t manage their applications, add or remove cloud resources powering workloads. IBM said the problem was intermittent and stemmed from a botched update to the interface. 2. GitLab, January 31 GitLab’s popular online code repository, GibLab.com, suffered an 18-hour service outage that ultimately couldn’t be fully remediated. The problem resulted when an employee removed a database directory from the wrong database server during maintenance procedures. 3. AWS, February 28 that shook the industry. This was the outage An Amazon Web Services engineer trying to debug an S3 storage system in the provider’s Virginia data center accidentally typed a command incorrectly, and much of the Internet including many enterprise platforms like Slack, Quora, and Trello was down for four hours. 4. Microsoft Azure, March 16 Storage availability issues plagued Microsoft’s Azure public cloud for more than eight hours, mostly affecting customers in the Eastern U.S. Some users had trouble provisioning new storage or accessing existing resources in the region. A Microsoft engineering team later identified the culprit as a storage cluster that lost power and became unavailable. Visit or to see a constantly updating long list of outages reported by end-users. Outage.Report Downdetector Chasing Software Quality As software engineers, we want to be proud of software systems we are building. We want theses systems to be of high quality, without functional bugs, security holes, providing exceptional performance, resilient to unexpected failures, self-healing, always available and easy to maintain and modernize. Every new project starts with “high quality” picture in mind and no one wants to create crappy software, but very few of us (or none) are able to achieve and keep intact all good “abilities”. So, what we can do to improve overall system quality? Should we do more testing? I tend to say “Yes” — software testing is critical. But just running unit, functional and performance testing is not enough. Today, building complex distributed system is much easier with all the new amazing technology we have. Microservice Architecture is a real trend nowadays and miscellaneous container technologies support this architecture. It’s much easier to deploy, scale, link, monitor, update and manage distributed systems, composed from multiple “microservices” than it used to be. When we build distributed systems, we choose ( ) from the and second to it either ( — the most popular choice) or ( ). So, we need to find a good approach for testing or systems. P Partition Tolerance CAP theorem A Availability C Consistency AP CP Traditional testing disciplines and tools do not provide a good answer to . Sure, you can learn from previous failures, after the fact, and you should definitely do it. But, learning from past experience should not be the only way to prepare for the future failures. how does your distributed system behave when unexpected stuff happens in production? Waiting for things to break in production is not an option. But what’s the alternative? Chaos Engineering The alternative is to break things on purpose. And is an approach for doing just that. The idea of Chaos Engineering is to Chaos Engineering embrace the failure! Chaos Engineering for distributed software systems was originally popularized by Netflix. Chaos Engineering defines an empirical approach to resilience testing of distributed software systems. You are testing a system by conducting . chaos experiments Typical : chaos experiment define a state of the system (e.g. by monitoring a set of system and business metrics) normal/steady pseudo-randomly inject faults (e.g. by terminating VMs, killing containers or changing network behavior) try to discover system weaknesses by deviation from expected or steady-state behavior The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. Chaos Engineering tools Of course, it’s possible to practice Chaos Engineering manually. But there are some nice tools to use. Netflix built a number for practicing Chaos Engineering in a public cloud (AWS): useful tools Chaos Monkey — kill EC2, kill processes, burn CPU, fill disk, detach volumes, add network latency, etc Chaos Kong — remove whole AWS Regions These are very good tools, I encourage you to use them. But when I started my new container-based project (2 years ago), it felt like these tools provided the granularity for I wanted to create. I wanted to create not only in real cluster, but also on single developer machine, to be able to debug and tune my application. I searched Google for , but did not find anything besides some basic Bash scripts. wrong chaos chaos Chaos Monkey for Docker So, I decided to create my own tool. From day one, I’ve shared it with the community as an open source project. It’s a Chaos Warthog for Docker — Monkey Pumba Pumba — Chaos Testing for Docker What is Pumba(a)? Those of us who have kids or were kids in 90s should remember this character from Disney’s animated film . In Swahili, means “ “. I like the Swahili meaning. It matched perfectly with the tool I wanted to create. The Lion King pumbaa to be foolish, silly, weak-minded, careless, negligent What Pumba can do? Pumba disturbs running Docker runtime environment by injecting different failures. Pumba can , , or Docker containers. kill stop remove pause Pumba can also do a network emulation, simulating different network failures, like: delay, packet loss (using different probability loss models), bandwidth rate limits and more. For network emulation, Pumba uses Linux kernel traffic control with queueing discipline, read more . If is not available within target container, Pumba uses a container with on-board, attaching it to the target container network. tc netem here tc sidekick tc You can pass list of containers to Pumba or just write a regular expression to select matching containers. If you do not specify containers, Pumba will try to disturb all running containers. Use option, to randomly select only one target containers from a provided list. It’s also possible to define a repeatable time interval and duration parameters to better control the amount of you want to create. --random chaos Pumba is available as a single binary file for Linux, MacOS and Windows, or as a Docker container. # Download binary from https://github.com/gaia-adm/pumba/releasescurl https://github.com/gaia-adm/pumba/releases/download/0.4.6/pumba_linux_amd64 --output /usr/local/bin/pumbachmod +x /usr/local/bin/pumba && pumba --help# Install with Homebrew (MacOS only)brew install pumba && pumba --help# Use Docker imagedocker run gaiaadm/pumba pumba --help Pumba commands examples First of all, run to get help about available commands and options and to get help for the specific command and sub-command. pumba --help pumba <command> --help # pumba helppumba --help# pumba kill helppumba kill --help# pumba netem delay helppumba netem delay --help Killing randomly chosen Docker container from regex list. ^test # on main pane/screen, run 7 test containers that do nothingfor i in {0..7}; do docker run -d --rm --name test$i alpine tail -f /dev/null; done# run an additional container with 'skipme' namedocker run -d --rm --name skipme alpine tail -f /dev/null# run this command in another pane/screen to see running docker containerswatch docker ps -a# go back to main pane/screen and kill (once in 10s) random 'test' container, ignoring 'skipme'pumba --random --interval 10s kill re2:^test# press Ctrl-C to stop Pumba at any time Adding a ( ) delay to the traffic for the container for seconds, using distribution model. 3000ms +-50ms engress ping 20 normal # run "ping" container on one screen/panedocker run -it --rm --name ping alpine ping 8.8.8.8# on second screen/pane, run pumba netem delay command, disturbing "ping" container; sidekick a "tc" helper containerpumba netem --duration 20s --tc-image gaiadocker/iproute2 delay --time 3000 jitter 50 --distribution normal ping# pumba will exit after 20s, or stop it with Ctrl-C To demonstrate packet loss capability, we will need three screens/panes. I will use network bandwidth measurement . On the first pane, run docker container with on-board and start there a UDP server. On the second pane, start docker container with and send datagrams to the container. Then, on the third pane, run command, adding a packet loss to the container. Enjoy the chaos. iperf tool server iperf client iperf server pumba netem loss client # create docker networkdocker network create -d bridge testnet# > Server Pane# run server containerdocker run -it --name server --network testnet --rm alpine sh -c "apk add --no-cache iperf; sh"# shell inside server container: run a UDP Server listening on UDP port 5001sh$ iperf -s -u -i 1# > Client Pane# run client containerdocker run -it --name client --network testnet --rm alpine sh -c "apk add --no-cache iperf; sh"# shell inside client container: send datagrams to the server -> see no packet losssh$ iperf -c server -u# > Server Pane# see server receives datagrams without any packet loss# > Pumba Pane# inject 20% packet loss into client container, for 1mpumba netem --duration 1m --tc-image gaiadocker/iproute2 loss --percent 20 client# > Client Pane# shell inside client container: send datagrams to the server -> see ~20% packet losssh$ iperf -c server -u Session and slides ContainerCamp UK 2017 session Slides from above session from Chaos Engineering for Docker Alexei Ledenev Hope, you find this post useful. I look forward to your comments and any questions you have. Originally published at codefresh.io on October 4, 2017.