In Kubernetes pod is the smallest deployable unit of workload. So the obvious question : “Where should the pods be deployed?” Pods always execute inside a Node. But but.. there are so many Nodes - which one should node should I deploy this pod to ??! Hello - “Kubernetes Scheduler” Let’s break down how the Kubernetes Scheduler works and the way it chooses a node in plain english with an analogy Lets say we have a “social-restaurant” where we have several tables and several seats around each table, lots of customers and a waiter for the hotel. “Social-restaurant” meaning different set of customers can sit around the same table, if there are enough seats and all conditions are met. Table = Node (VM or a Physical Machine) Seats = Resources availability on the VM Waiter = Kube-Scheduler Customer-Group = Pod A single customer inside the group = Container 1. Resource requirements and availability A Customer-Group comes into the restaurant and makes a simple request for a table to be seated. The waiter analyzes the requirement of the Customer-Group and looks at how many Seats they would need. He then looks through all the available tables, filters the tables that cannot be “ scheduled ” and assigns ( binds ) them a table which would meet their Seating requirement. This is the basic kind of scheduling - where the kube scheduler constantly watches the API server to see if there are any pods which are unscheduled. Looks through the resource requirement for each of the container inside the pods. Remember the containers are the ones which have the resource requirement in the spec not the pods themselves. In the example below we have a CPU and Memory requirement under the container spec for the pod deployment. Requirements are 500 millicpu and an 128 MiB of memory. apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - name: nginx image: nginx:1.7.9 resources: requests: memory: "128Mi" cpu: "500m" Now let’s take a look at one of the nodes ( ) to ensure they have capacity. The way we would do that is run the following command : restaurant tables kubectl describe nodes <node-name> which outputs all the properties of the nodes, but the ones we are interested are the Capacity and Allocatable. Remember CPU and memory are not the only filter criteria, there are lots more like persistent storage, network port requirements. Detailed list here. 2. Node Selector Another Customer-Group come in to the restaurant with a requirement to be placed in any table which is “ blue ”. The waiter looks through his inventory and finds all the tables which has a label of blue and assigns the Customer-Group to the appropriate table In this scenario the pod has a nodeSelector (key-value pair) specified which requests the pod to be deployment to any node which matches the key-value pair. Here is how the new YAML file will look like: apiVersion: v1 kind: Pod metadata: name: nginx-blue spec: containers: - name: nginx image: nginx:1.7.9 nodeSelector: color: blue In order to query all my nodes to check if we have the label “blue”, we run the following command. kubectl get nodes --show-labels From the list we can see that “worker-2” has a label of . Also there are several built-in labels which Kubernetes provides us with too. color=blue Great ! If you now deploy this, the scheduler automatically assigns it to the right node. (worker-2). We can confirm this by running the following command. kubectl get pod -o wide Note that if you did not have a node with the appropriate label the deployment would be Pending. 3. Node affinity and anti-affinity Node affinity and anti-affinity are a lot like node selectors, but it gives you more flexibility by supporting expressive language and soft/hard preference rather than just a hard . requirement Lets say another Customer-Group come in to the restaurant . They have a preference to be placed in any table which is “ ocean-view ”, but its not required . The waiter looks through his inventory and finds all the tables which has a label of “ ocean ” and assigns the Customer-Group to the appropriate table In this example the pod has a nodeAffinity defined which states that we a “ ” which matches the key value pair –> ( and we do that by the matchExpressions below) prefer node view : ocean There are two options here : : which means that the nodes that match the criteria will be preferred, but not guaranteed when the allocation to the node happens. preferredDuringSchedulingIgnoredDuringExecution IgnoredDuringExecution - If you remove or change the label of the node after the pod is scheduled, the pod won’t be removed. In other words, the affinity selection works only at the time of scheduling the pod but not at execution : which means that the nodes that match the criteria will be required when selecting the node. IgnoredDuringExecution is the same as before. requiredDuringSchedulingIgnoredDuringExecution apiVersion: v1 kind: Pod metadata: name: nginx-oceanview spec: containers: - name: nginx image: nginx:1.7.9 affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: view operator: In values: - ocean The operator in this case can also be other values such as In, NotIn, Exists, DoesNotExist, Gt, Lt. NotIn and DoesNotExist will create the opposite effect of nodeAntiAffinity 4. Pod affinity and anti-affinity Another vegan girl-gang Customer-Group come in to the restaurant . They have a requirement not to be placed in any table which contain seats which are already occupied by meat eaters. They are a little more choosy - they also want to be seated in tables which contains seats which are already occupied by only girls. In other words they have a non-affinity towards meat-eaters, but have an affinity towards girls. Lets take a real world scenario where you have a set of redis-cache and web-server deployments. Here are the conditions: You want the pods deployed as close to the pods as possible (podAffinity)You don’t want two pods in the same node (podAntiAffinity)You don’t want to deploy two pods in the same node (podAntiAffiinity)You want these rules to hold good for the of a node. (topology) redis-cache web-servers redis-cache web-server scope Here is how the deployment YAML would look. : redis-cache apiVersion: apps/v1 kind: Deployment metadata: name: redis-cache spec: selector: matchLabels: apptype: redis-cache replicas: 3 template: metadata: labels: apptype: redis-cache spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: apptype operator: In values: - redis-cache topologyKey: "kubernetes.io/hostname" containers: - name: redis-server image: redis:3.2-alpine In the example above: In the example above you see that the redis-cache label ( ) is added to every pod which gets deployed as part of this deploymentThe is described such that no two redis-cache pods are deployed inside the same server. This is defined by the built-in apptype=redis-cache podAntiAffinity apiVersion: apps/v1 kind: Deployment metadata: name: web-server spec: selector: matchLabels: apptype: web-server replicas: 3 template: metadata: labels: apptype: web-server spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: apptype operator: In values: - web-server topologyKey: "kubernetes.io/hostname" podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: apptype operator: In values: - redis-cache topologyKey: "kubernetes.io/hostname" containers: - name: web-app image: nginx:1.12-alpine In the example above: In the example above you see that the web-server label ( ) is added to every pod which gets deployed as part of this deployment apptype=web-server The is described such that no two pods are deployed inside the same server. This is defined by the built-in “kubernetes.io/hostname” which means its a single node. This can also be extended to zones or any other legal-key if required. podAntiAffinity web-server topologyKey The is described such that the pods are deployed as close to the as possible. podAffinity web-server redis-cache Once you deploy this - we got what we were aiming for - 3 web-servers and 3 redis-cache servers - one copy of each on one node ! 5. Taint and Tolerations This time around the restaurant got one of the tables “ tainted ” with a peanut spillage disaster. So they have said no new Customer-Groups will be scheduled on this table to avoid allergic reactions. So any new Customer-Groups are placed on every other table except this tainted one. So far we have been looking at scheduling from a pod perspective. But what if the other way around the node decides not to schedule anymore new pods ? This is where come in. Once you a node you have two options : taints taint - This means no new pods should be scheduled on this pod once its tainted. *unless they have a toleration NoSchedule - Existing pods will be evicted from the pod once its tainted. *unless they have a toleration (we will talk about tolerations in just a minute) NoExecute So how do we taint a node ? kubectl taint nodes <nodename> mytaintkey=mytaintvalue:NoSchedule Once we have this set the node is now tainted with the following key value pair ( ). So no new pods can be scheduled. mytaintkey=mytaintvalue But what if you want to evict the existing pods from the nodes ? kubectl taint nodes <nodename> mytaintkey=mytaintvalue:NoExecute This will evict all pods from the current node and move them to another available node. But after a while a Customer-Group comes along and says - “Oh that’s fine. We have “toleration” for peanut allergies. So please proceed and place us in the “ tainted ” table”. The Kube scheduler verifies their toleration and places them in the tainted table Now if the pod has a toleration for the taint key value that the node has specified, then this pod will get exempted from the taint and will be placed on the node if necessary. apiVersion: v1 kind: Pod metadata: name: web-server spec: containers: - name: web-app image: nginx:1.12-alpine tolerations: - key: "mytaintkey" operator: "Equal" value: "mytaintvalue" effect: "NoExecute" This post originally appeared on https://www.azuremonk.com/blog/kube-scheduler

Zones

Kubernetes scheduler visually explained in plain English with a story

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Kubernetes Scheduler: Explained in Plain English with Comics

Containers 101: Kube Explained Part 2

How CI/CD and Microservices Led to Kubernetes: Kube Explained Part 1

Inside Kubernetes Scheduling: How Your Pods Fight for a Place to Exist

105 Stories To Learn About K8s

101 Stories To Learn About Cloud Infrastructure

Kubernetes Scheduler: Explained in Plain English with Comics

Containers 101: Kube Explained Part 2

How CI/CD and Microservices Led to Kubernetes: Kube Explained Part 1

Inside Kubernetes Scheduling: How Your Pods Fight for a Place to Exist

105 Stories To Learn About K8s

101 Stories To Learn About Cloud Infrastructure

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps