When you deploy an application in Kubernetes, your code ends up running on one or more worker nodes. A node may be a physical machine or VM such as AWS EC2 or Google Compute Engine and having several of them means you can run and scale your application across instances efficiently. If you have a cluster made of three nodes and decide to scale your application to have four replicas, Kubernetes will spread the replicas across the nodes evenly like so: The architecture described above works particularly well in case of failures. If the first node were to be unavailable, the other two could still serve the application. Meanwhile, Kubernetes has enough time to reschedule the fourth replica to another node. Even better, if all of the nodes were to become isolated, they could still serve traffic. Let’s scale down the application to two replicas: Since each node can serve the application, how does the third node know that it doesn’t run the application and has to route the traffic to one of the other nodes? Kubernetes has a binary called kube-proxy that runs on each node, and that is in charge of routing the traffic to a specific pod. You can think of kube-proxy like a receptionist. The proxy intercepts all the traffic directed to the node and routes it to the right pod. But how does kube-proxy know where all the pods are? It doesn’t. The master node knows and is in charge of creating the list with all the routing rules. kube-proxy is in charge of checking and enforcing the rules on the list. In the simple scenario above, the list looks like this: everything Application instance 1 is available on Node 1 Application instance 2 is available on Node 2 It doesn’t matter which node the traffic is coming from; kube-proxy knows where the traffic should be forwarded to by looking at the list. But what happens when kube-proxy crashes? And what if the list of rules is lost? What happens when there’s no rule to forward the traffic to? . So he decided to find out. Manabu Sakai had the same questions Let’s assume you have a 2 node cluster on GCP: $ kubectl get nodes NAME STATUS ROLES AGE VERSION node1 Ready <none> 17h v1.8.8-gke.0 node2 Ready <none> 18h v1.8.8-gke.0 And you deployed Manabu’s application with: $ kubectl create -f https://raw.githubusercontent.com/manabusakai/k8s-hello-world/master/kubernetes/deployment.yml $ kubectl create -f https://raw.githubusercontent.com/manabusakai/k8s-hello-world/master/kubernetes/service.yml The application is simple. It displays the hostname of the current pod in a web page: You should scale the deployments to ten replicas with: $ kubectl scale --replicas 10 deployment/k8s-hello-world The ten replicas are distributed evenly across the two nodes: $ kubectl get pods NAME READY STATUS NODE k8s-hello-world-55f48f8c94-7shq5 1/1 Running node1 k8s-hello-world-55f48f8c94-9w5tj 1/1 Running node1 k8s-hello-world-55f48f8c94-cdc64 1/1 Running node2 k8s-hello-world-55f48f8c94-lkdvj 1/1 Running node2 k8s-hello-world-55f48f8c94-npkn6 1/1 Running node1 k8s-hello-world-55f48f8c94-ppsqk 1/1 Running node2 k8s-hello-world-55f48f8c94-sc9pf 1/1 Running node1 k8s-hello-world-55f48f8c94-tjg4n 1/1 Running node2 k8s-hello-world-55f48f8c94-vrkr9 1/1 Running node1 k8s-hello-world-55f48f8c94-xzvlc 1/1 Running node2 A Service was created to load balance the requests across the ten replicas: $ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE k8s-hello-world NodePort 100.69.211.31 <none> 8080:30000/TCP 3h kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 18h The service is exposed to the outside world using NodePort on port 30000. In other words, each node has port 30000 opened to the public internet and can accept incoming traffic. But how is the traffic routed from port 30000 to my pod? kube-proxy is in charge of setting up the rules to route the incoming traffic from port 30000 to one of the ten pods. You should try to request the node on port 30000: Please note that you can retrieve the node’s IP with kubectl get nodes -o wide The application replies with and the hostname of the container is running on. In the previous command, you should be greeted by Hello world! via <hostname>. Hello World! If you keep requesting the same URL, you may notice how sometimes you get the same response and sometimes it changes. kube-proxy is acting as a load balancer and is looking at the routing list and distributing the traffic across the ten pods. What’s more interesting is that it doesn’t matter which node you request. The response could come from any pod, even one that is not hosted on the same node you requested. To complete your setup, you should have an external load balancer routing the traffic to your nodes on port 30000. The load balancer will route the incoming traffic from the internet to one of the two nodes. If you’re confused by how many load balancer-like things we have, let’s quickly recap: Traffic coming from the internet is routed to the primary load balancer The load balancer forwards the traffic to one of the two nodes on port 30000 The rules set up by kube-proxy route the traffic from the node to a pod the traffic reaches the pod Phew! That was long! It’s time to break things Now that you know how things are plugged in together let’s get back to the original question. What if you tamper with the routing rules? Will the cluster still work? Do the pods still serve requests? Let’s go ahead and delete the routing rules. In a separate shell, you should monitor the application for time and dropped requests. You could write a loop that every second prints the time and request the application: $ while sleep 1; do date +%X; curl -sS http://<your load balancer ip>/ | grep ^Hello; done In this case, you have the time in the first column and the response from the pod in the other: 10:14:41 Hello world! via k8s-hello-world-55f48f8c94-vrkr9 10:14:43 Hello world! via k8s-hello-world-55f48f8c94-tjg4n k8s-hello-world-55f48f8c94-vrkr9 The first call was made to the pod at 10:14 and 41 seconds. k8s-hello-world-55f48f8c94-tjg4n The second call was made to the pod at 10:14 and 43 seconds. Let’s delete the routing rules from the node. kube-proxy can operate in three modes: , and . The default since Kubernetes 1.2 is . userspace iptables ipvs iptables In mode, kube-proxy writes the list of routing rules to the node using iptables rules. iptables So you could log in into one of the node servers and delete the iptables rules with iptables -F. Please note that iptables -F may interfere with your SSH connection. If everything went according to plan you should experience something similar to this: 10:14:41 Hello world! via k8s-hello-world-55f48f8c94-xzvlc 10:14:43 Hello world! via k8s-hello-world-55f48f8c94-tjg4n # this is when `iptables -F` was issued 10:15:10 Hello world! via k8s-hello-world-55f48f8c94-vrkr9 10:15:11 Hello world! via k8s-hello-world-55f48f8c94-vrkr9 As you noticed, it took about 27 seconds from when you dropped the iptables rules and the next response, from 10:14:43 to 10:15:10. What happened in this 27 seconds? Why is everything back to normal after 27 seconds? Perhaps it’s just a coincidence. Let’s flush the rules again: 11:29:55 Hello world! via k8s-hello-world-55f48f8c94-xzvlc 11:29:56 Hello world! via k8s-hello-world-55f48f8c94-tjg4n # this is when `iptables -F` was issued 11:30:25 Hello world! via k8s-hello-world-55f48f8c94-npkn6 11:30:27 Hello world! via k8s-hello-world-55f48f8c94-vrkr9 There was a gap of 29 seconds, from 11:29:56 to 11:30:25, but the cluster is back to normal. Why does it take about 30 seconds to reply? Is the node receiving traffic despite no routing table? Maybe you could investigate what happens to the node in this 30 seconds. In another terminal, you should write a loop to make requests to the application every second. But this time, you should request the node and not the load balancer: $ while sleep 1; printf %"s\n" $(curl -sS http://<ip of the node>:30000); done And let’s drop the iptables rules. The log from the previous command is: Hello world! via k8s-hello-world-55f48f8c94-xzvlc Hello world! via k8s-hello-world-55f48f8c94-tjg4n # this is when `iptables -F` was issued curl: (28) Connection timed out after 10003 milliseconds curl: (28) Connection timed out after 10004 milliseconds Hello world! via k8s-hello-world-55f48f8c94-npkn6 Hello world! via k8s-hello-world-55f48f8c94-vrkr9 It shouldn’t come as a surprise that connections to the node are timing out after you drop the iptables rules. What’s more interesting is that curl waits for ten seconds before giving up. What if in the previous example the load balancer is waiting for the connection to be made? That would explain the 30 seconds delay. But it doesn’t tell why the node is ready to accept a connection when you wait long enough. So why is the traffic recovering after 30 seconds? Who is putting the iptables rules back? Before you drop the iptables rules, you can inspect them with: Soon after you drop the rules, you should keep executing iptables -F and notice that the rules are back in few seconds! Is this you, kube-proxy? Yes, it is. Digging in the reveals two interesting flags: official documentation for kube-proxy --iptables-sync-period - The maximum interval of how often iptables rules are refreshed (e.g. ‘5s’, ‘1m’, ‘2h22m’). Must be greater than 0. (default 30s) --iptables-min-sync-period - The minimum interval of how often the iptables rules can be refreshed as endpoints and services change (e.g. ‘5s’, ‘1m’, ‘2h22m’). (default 10s) kube-proxy refreshes the iptables rules every 10 to 30 seconds. If we drop the iptables rules, it will take up to 30 seconds for kube-proxy to realise and restore them back. That explains why it took 30 seconds to get your node back! It also explains how routing tables are propagated from the master node to the worker node. kube-proxy is in charge of syncing them on a regular basis. In other words, every time a pod is added or deleted, the master node recomputes the routing list. On a regular interval, kube-proxy syncs the rules into the current node. Let’s recap how Kubernetes and kube-proxy can recover from someone tampering with the iptables rules on the node: The iptables rules are deleted from the node A request is forwarded to the load balancer and routed to the node The node doesn’t accept incoming requests, so the load balancer waits After 30 seconds kube-proxy restores the iptables The node can serve traffic again. The iptables rules forward the request from the load balancer to the pod The pod replies to the load balancer with a 30 seconds delay Waiting for 30 seconds may be unacceptable for your application. You may be interested in tweaking the default refresh interval for kube-proxy. So where are the settings and how can you change them? It turns out that there’s an agent on the node — — that is in charge of starting kube-proxy as a static pod on each node. The documentation for static pods suggests that the kubelet scans a specific folder and creates all the resources contained in that folder. the kubelet If you inspect the kubelet process in the node, you should be able to see the kubelet running with --pod-manifest-path=/etc/kubernetes/manifests. Running a simple ls reveals the truth: $ ls -l /etc/kubernetes/manifests total 4 -rw-r--r-- 1 root root 1398 Feb 24 08:08 kube-proxy.manifest And a quick cat of kube-proxy.manifest reveals the content: apiVersion: v1 kind: Pod metadata: name: kube-proxy spec: hostNetwork: true containers: - name: kube-proxy image: gcr.io/google_containers/kube-proxy:v1.8.7-gke.1 command: - /bin/sh - -c -> echo -998 > /proc/$$$/oom_score_adj && exec kube-proxy --master=https://35.190.207.197 --kubeconfig=/var/lib/kube-proxy/kubeconfig --cluster-cidr=10.4.0.0/14 --resource-container="" --v=2 --feature-gates=ExperimentalCriticalPodAnnotation=true --iptables-sync-period=30s 1>>/var/log/kube-proxy.log 2>&1 Please note that the content was truncated and is not shown in full. Mystery unravelled! You can see how --iptables-sync-period=30s is used to refresh the iptables rules every 30 seconds. You could go ahead and modify that command to customise the min and max time to update the iptables rules for that node. Lessons learned Dropping iptables rules is similar to make a node unavailable. The traffic is still routed to the node, but the node is not able to forward it further. Kubernetes can recover from a similar failure by monitoring the state of the routing rules and updating them when necessary. Many thanks to ’s blog post that was a huge inspiration and to for investigating the issue with the iptables propagation. Manabu Sakai Valentin Ouvrard If you liked the article, you should stay tuned for more! Subscribe to our newsletter at ! learnk8s.io Originally published at learnk8s.io on May 15, 2018.