Having flaky tests in your CI is a nightmare. You can't tell whether your new code broke something or if it's just those tests being flaky again. So anytime we see strange, random failures in CI for our open source project, Adapt, we try to track down the culprit ASAP. This is the story of how we discovered we were (accidentally) flooding our DNS server with traffic and how we used a DNS cache in Docker to solve the problem. Background One of the open source projects I work on, can deploy apps into multiple clouds and technologies, so there's a ton of system testing and end-to-end testing with Docker, Kubernetes, AWS, Google Cloud and other similar technologies. AdaptJS We make heavy use of Docker in our tests, so we end up creating lots of short-lived containers that start up, do some work, like building or installing an app, and then get deleted. And as we added more and more of those tests, we started seeing previously stable system tests fail randomly in CI. The symptom: test timeouts The first symptoms we saw were test timeouts. We have fairly short timeouts on many of our end-to-end tests so we can detect if new code suddenly makes things take longer for end users. But now, a test that should normally take a half second would take 5.5 seconds. sometimes The additional 5 seconds was a great clue--5 seconds sounded like it could be a timeout of some kind. Armed with that hunch, we looked back through all the seemingly random test failures and found the common thread: they were all tests that initiated network requests. We also noticed a few tests that had taken even longer to fail...always in increments of 5 seconds. There weren't too many network protocols that could be involved here, so some quick Googling pointed us in the right direction. The default timeout for DNS server queries on Linux just . happens to be 5 seconds To see what was going on with DNS, we reached for probably the single most important tool for debugging network issues on Linux: . (Or, if you prefer a GUI version, is great too.) We ran tcpdump on the host system (an Amazon Workspaces Linux instance) and used a filter to see the DNS traffic: tcpdump wireshark 11:35:59.474735 IP 172.16.0.131.54264 > 172.16.0.119.domain: 64859+ AAAA? registry-1.docker.io. (38) 11:35:59.474854 IP 172.16.0.131.49631 > 172.16.0.119.domain: 43524+ A? registry-1.docker.io. (38) 11:35:59.476871 IP 172.16.0.119.domain > 172.16.0.131.49631: 43524 8/0/1 A 34.197.189.129, A 34.199.40.84, A 34.199.77.19, A 34.201.196.144, A 34.228.211.243, A 34.232.31.24, A 52.2.186.244, A 52.55.198.220 (177) 11:35:59.476957 IP 172.16.0.119.domain > 172.16.0.131.54264: 64859 0/1/1 (133) $ tcpdump -n -i eth1 port 53 The first thing that we noticed was that we were generating a huge flood of DNS queries to the AWS default DNS server for our VPC. It looked like all those short-lived containers tended to do a bunch of DNS lookups when starting up, for various reasons. Next, we noticed that some of those DNS queries just went unanswered. It's pretty common for shared DNS servers to implement rate limits so that a single user can't degrade performance for everyone else. Here, we suspected that the AWS DNS servers were doing exactly that. We weren’t able to find a way to confirm whether we were actually hitting AWS rate limits, but it seemed wise for us not to DoS our DNS server. The solution: a Docker DNS cache, using dnsmasq In order to isolate DNS traffic within the host, we needed a local DNS server to act as a cache. A great choice for a cache like this is . It’s reliable, widely used, and super simple to set up. And since all of our testing runs inside Docker containers, it made sense to run the DNS server in Docker too. dnsmasq The basic idea is pretty simple: run a dnsmasq container as the DNS cache on the Docker host network and then run our test containers with the pointing to the cache container’s IP address. option --dns Here’s the script that starts the DNS cache container: dns_cache : : : () { docker run --rm --net=host busybox ifconfig 2>/dev/null | \ awk } docker inspect -- container >& /dev/null ; [ -f ]; cat 0 DNS cache container running but file does not exist. >&2 1 DOCKER_HOST_NETWORK=bridge IS_DEFAULT=$(docker network inspect --format ) [ != ]; Cannot start DNS cache. The Docker network named \ docker0 #!/usr/bin/env bash " " ${IMAGE:=andyshinn/dnsmasq:2.76} " " ${NAME:=dnsmasq} " " ${ADAPT_DNS_IP_FILE:=/tmp/adapt_dns_ip} # Get IP address for an interface, as visible from inside a container # connected to the host network interfaceIP # Run a container and get ifconfig output from inside # We need the ifconfig that will be visible from inside the dnsmaq # container " " $1 '/inet /{print(gensub(/^.*inet (addr:)?([0-9.]+)\s.*$/, "\\2", 1))}' if type " " ${NAME} then if " " ${ADAPT_DNS_IP_FILE} then # dnsmasq is already started " " ${ADAPT_DNS_IP_FILE} exit else echo ${ADAPT_DNS_IP_FILE} exit fi fi # We only support attaching to the default (host) bridge named "bridge". # Confirm that "bridge" is the default bridge " " ${DOCKER_HOST_NETWORK} '{{(index .Options "com.docker.network.bridge.default_bridge")}}' if " " ${IS_DEFAULT} "true" then echo " \" does not exist or is not the default bridge. >&2 exit 1 fi # Get the Linux interface name for the bridge, typically " ${DOCKER_HOST_NETWORK} " INTF_NAME= }}') if [ -z " $(docker network inspect "${DOCKER_HOST_NETWORK}" --format '{{(index .Options "com.docker.network.bridge.name") ${INTF_NAME} " ]; then echo Cannot start DNS cache. Unable to determine default bridge interface name. >&2 exit 1 fi # Get the IP address of the bridge interface. This is the address that # dnsmasq will listen on and other containers will send DNS requests to. IP_ADDR= if [ -z " $(interfaceIP "${INTF_NAME}") ${IP_ADDR} " ]; then echo Cannot start DNS cache. Docker bridge interface does not exist. >&2 exit 1 fi # Run the dnsmasq container. The hosts's /etc/resolv.conf configuration will # be used by dnsmasq to resolve requests. docker run --rm -d --cap-add=NET_ADMIN --name " ${INTF_NAME} ${NAME} " --net=host -v/etc/resolv.conf:/etc/resolv.conf " ${IMAGE} " --bind-interfaces --listen-address=" ${IP_ADDR} " --log-facility=- > /dev/null if [ $? -ne 0 ]; then echo Cannot start DNS cache. Docker run failed. exit 1 fi # Remember what IP address to use as DNS server, then output it. echo > " ${IP_ADDR} ${ADAPT_DNS_IP_FILE} " echo ${IP_ADDR} In addition to starting the container (if it’s not already running), the script outputs the cache container’s IP address. We’ll use that on the command line of any other containers we start. The script also ensures that dnsmasq only listens for DNS requests within Docker (on the Docker bridge interface), so there’s a little additional work to determine the IP address to listen on. Here’s an example of how to start the DNS cache, remembering the IP address in variable and then running another container that will use the cache. DNS_IP $ DNS_IP=$(dns_cache) $ docker run --dns --rm busybox ping -c1 adaptjs.org ${DNS_IP} Verifying the cache works After we started using the cache in our testing, the number of DNS queries that the host system sent to the AWS DNS server dropped to a small trickle. We also confirmed that the cache was operating properly by checking the dnsmasq statistics. Sending a to dnsmasq causes it to : SIGUSR1 print statistics to its log dnsmasq[1]: cache size 150, 1085/4664 cache insertions re-used unexpired cache entries. dnsmasq[1]: queries forwarded 1712, queries answered locally 3940 dnsmasq[1]: queries for authoritative zones 0 dnsmasq[1]: server 172.16.0.119#53: queries sent 1172, retried or failed 0 dnsmasq[1]: server 172.16.1.65#53: queries sent 252, retried or failed 0 dnsmasq[1]: server 172.16.0.2#53: queries sent 608, retried or failed 0 $ docker -s USR1 dnsmasq kill $ docker logs dnsmasq And most importantly, we saw a dramatic decrease in system test timeouts and our CI runs stabilized. This issue took us a while to track down. But keeping CI healthy is extremely important. If you have too many sporadic test failures, developers tend to ignore CI results and push potentially broken code. So, even though it was time consuming to track down these failures, given the ease of the fix, it was definitely worth the investment. Originally published on the Adapt blog .