Inside Kubernetes Scheduling: How Your Pods Fight for a Place to Exist

Important note: as almost everything in Kubernetes(further K8s), scheduling, as a process can be customized/extended by user to their demands. In this guide we will talk about kube-scheduler | Kubernetes with default plugins enabled. Important note: as almost everything in Kubernetes(further K8s), scheduling, as a process can be customized/extended by user to their demands. In this guide we will talk about kube-scheduler | Kubernetes with kube-scheduler | Kubernetes default plugins enabled . 1. Inside kube-scheduler What is scheduling In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that Kubelet can run them. The central component responsible for scheduling decisions is kube-scheduler. scheduling making sure that Pods are matched to Nodes so that Kubelet can run them. Pods Nodes Kubelet kube-scheduler In a bird's eye view kube-scheduler would look like this: kube-scheduler kube-scheduler watches for newly created/updated Pods and adds them to the scheduling queue. The top priority item from the queue then goes through the scheduling process, which consists of scheduling and binding cycles. At the end, the Pod is either scheduled or going back to the scheduler's queuing mechanism to wait until it will be considered schedulable again. kube-scheduler scheduling queue scheduling and binding cycles scheduler's queuing mechanism Scheduler's queueing mechanism The scheduling cycle is synchronous, therefore Pods have to wait for their turn to be scheduled. During scheduling, if conditions specified by Pod not yet met(existence of a persistent volume/compliance with affinity rules/etc.), the Pod needs to be moved back to the waiting line. For that reason kube-scheduler has queueing mechanism that consists of multiple data structures serving different purposes^. scheduling cycle kube-scheduler ^ These data structures are: Active queue(ActiveQ) -- providing pods for immediate scheduling. The Pods here are either newly created or ones that are ready to be retried for scheduling. Implemented as a heap which orders Pods using plugins that implement QueueSort extension point. By default, PrioritySort plugin is used -- as the name suggests, it sorts Pods by highest priority. Unschedulable pods map -- when scheduling for Pod fails, either during scheduling or binding cycle, the Pod is considered as unschedulable and placed in this map(or straight in BackoffQ if move request was received, see details below). The Pods are held here until some change in a cluster happens(new node added, PV created, Pod affinity rules satisfied, etc.) that could make those Pods schedulable. Backoff queue(BackoffQ) -- holds previously unschedulable pods for a backoff period, before they are back to ActiveQ. The backoff period raises exponentially, depending on the number of unsuccessful scheduling attempts for that Pod. Active queue(ActiveQ) -- providing pods for immediate scheduling. The Pods here are either newly created or ones that are ready to be retried for scheduling. Implemented as a heap which orders Pods using plugins that implement QueueSort extension point. By default, PrioritySort plugin is used -- as the name suggests, it sorts Pods by highest priority. ActiveQ QueueSort PrioritySort Unschedulable pods map -- when scheduling for Pod fails, either during scheduling or binding cycle, the Pod is considered as unschedulable and placed in this map(or straight in BackoffQ if move request was received, see details below). The Pods are held here until some change in a cluster happens(new node added, PV created, Pod affinity rules satisfied, etc.) that could make those Pods schedulable. Unschedulable BackoffQ Backoff queue(BackoffQ) -- holds previously unschedulable pods for a backoff period, before they are back to ActiveQ. The backoff period raises exponentially, depending on the number of unsuccessful scheduling attempts for that Pod. BackoffQ ActiveQ Pods from ActiveQ are popped by scheduler, when it's ready to process them. Pods in BackoffQ and in Unschedulable map are waiting for certain condition(s) to happen. ActiveQ BackoffQ Unschedulable Pods, that failed to be scheduled, first are placed in Unschedulable pods map -- from which they can move either to BackoffQ or to ActiveQ directly. Pods are moved from Unschedulable map on few occasions: Unschedulable pods map BackoffQ ActiveQ Unschedulable flushUnschedulablePodsLeftover -- is the routine which is running every 30 seconds(hard-coded value^). It selects Pods which stay in the map longer than required amount of time(set by PodMaxInUnschedulablePodsDuration) and by using queueing hint^ determines if Pod could be schedulable again -- if so, moves it either to BackoffQ or to ActiveQ(if backoff period for Pod is ended already). Move request which can be triggered either by changes to nodes, PVs, etc.^ or by plugins^. When triggered it's using the same logic as flushUnschedulablePodsLeftover. flushUnschedulablePodsLeftover -- is the routine which is running every 30 seconds(hard-coded value^). It selects Pods which stay in the map longer than required amount of time(set by PodMaxInUnschedulablePodsDuration) and by using queueing hint^ determines if Pod could be schedulable again -- if so, moves it either to BackoffQ or to ActiveQ(if backoff period for Pod is ended already). flushUnschedulablePodsLeftover ^ PodMaxInUnschedulablePodsDuration ^ BackoffQ ActiveQ Move request which can be triggered either by changes to nodes, PVs, etc.^ or by plugins^. When triggered it's using the same logic as flushUnschedulablePodsLeftover. Move request ^ ^ flushUnschedulablePodsLeftover Pods placed in BackoffQ are waiting for a backoff period to end. flushBackoffQCompleted routine is running every second(hard-coded value ^) which simply moves all pods that completed backoff to activeQ. BackoffQ flushBackoffQCompleted ^ activeQ Scheduling process When top priority Pod from ActiveQ is popped by scheduler it will go through scheduling and binding cycles. ActiveQ scheduling and binding cycles First is scheduling cycle^, which is synchronous(meaning that only one Pod at the time is going through the cycle) and consists of 2 stages: scheduling cycle ^ Filtering nodes on which the Pod can be deployed, based, for example, on node labels, resource utilization and so on. Scoring the nodes returned by filtering stage based on preferences and optimization rules such as topology spread constraints -- to select the best option. Filtering nodes on which the Pod can be deployed, based, for example, on node labels, resource utilization and so on. Filtering Scoring the nodes returned by filtering stage based on preferences and optimization rules such as topology spread constraints -- to select the best option. Scoring filtering After decision is made in scheduling cycle, it's time for a binding cycle^ -- which running asynchronously(and allowing another Pod to go through the scheduling cycle) is responsible for notifying API server about the decision. binding cycle ^ Fundamental part of each cycle are extension points, which are implemented by plugins. Basically, kube-scheduler implements the glue between calls to plugins, which are responsible for the actual scheduling decisions. For example, there's NodeName plugin which implements Filter extension point -- it checks if there's node name in a Pod spec and matches it to the actual node -- if this plugin is disabled, users will not be able to assign Pods to specific nodes. extension points kube-scheduler NodeName Filter The list of default plugins can be found in Kubernetes docs ^. ^ 2. Quick note on preemption and evictions I prefer to formulate the concepts of preemption and eviction slightly different than in official docs^: preemption eviction ^ Preemption is the process of freeing node from the Pods with lower priority(look into priority classes^), to make space for the Pod with higher priority. Eviction(which comes in different forms) is the removal of the Pod from the node. Therefore, eviction can be a part of preemption process. Preemption is the process of freeing node from the Pods with lower priority(look into priority classes^), to make space for the Pod with higher priority. Preemption ^ Eviction(which comes in different forms) is the removal of the Pod from the node. Therefore, eviction can be a part of preemption process. Eviction eviction preemption Concerning the scheduler. If Pod fails to be scheduled during the scheduling cycle the PostFilter plugins are called^. By default, it's only the DefaultPreemption plugin. PostFilter ^ DefaultPreemption DefaultPreemption goes through the nodes and checks if node preemption will allow to schedule the Pod to this node. If so, it will evict the lower-priority pods and send the currently processed Pod to be rescheduled. DefaultPreemption The eviction of Pod can be done in multiple ways. For example, API-initiated eviction(for example, by calling kubectl drain) will use Eviction API^ which will respect Pod Disruption Budget(PDB). eviction kubectl drain Eviction ^ The eviction during node preemption works by removing nominatedNodeName field from evicted pods statuses, without respecting PDBs or QoS^. The Scheduler's preemption process will try to respect PDBs, when selecting pods for eviction, but if Pod with PDB is the only option, it will select it. nominatedNodeName ^ 3. Let's schedule some Pods Examples in this section can be ran on local cluster provided by kind. This command will create the cluster: kind create cluster --config=kind.conf. Examples in this section can be ran on local cluster provided by kind. This command will create the cluster: kind kind create cluster --config=kind.conf kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane - role: worker labels: zone: west disktype: ssd - role: worker labels: zone: west - role: worker labels: zone: east - role: worker labels: zone: east kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane - role: worker labels: zone: west disktype: ssd - role: worker labels: zone: west - role: worker labels: zone: east - role: worker labels: zone: east kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane - role: worker labels: zone: west disktype: ssd - role: worker labels: zone: west - role: worker labels: zone: east - role: worker labels: zone: east If you want to dive deeper into runtime operations of kube-scheduler -- it's possible to increase log level by getting into control plane node(for example using node-shell plugin for kubectl: kubectl node-shell kind-control-plane) and then modifying kube-scheduler manifest -- /etc/kubernetes/manifests/kube-scheduler.yaml, for example with command: sed -i '19i \ \ \ \ - --v=10' /etc/kubernetes/manifests/kube-scheduler.yaml. kube-scheduler node-shell kubectl kubectl node-shell kind-control-plane kube-scheduler /etc/kubernetes/manifests/kube-scheduler.yaml sed -i '19i \ \ \ \ - --v=10' /etc/kubernetes/manifests/kube-scheduler.yaml The kube-scheduler will be restarted automatically by control plane, after changes to manifests are applied. Although, it's important to note that kube-scheduler doesn't log calls to filter plugins and relies on logging on the plugin side. Which most default plugins don't do that well. kube-scheduler kube-scheduler First steps As we discussed earlier, standard kube-scheduler setup has a number of default plugins activated. A bunch of those don't need any additional config in Pod spec to affect the scheduling. kube-scheduler Let's apply the simple manifest, without any additional rules for scheduler: apiVersion: apps/v1 kind: Deployment metadata: name: simple spec: selector: matchLabels: app: simple replicas: 1 template: metadata: labels: app: simple spec: containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: simple spec: selector: matchLabels: app: simple replicas: 1 template: metadata: labels: app: simple spec: containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: simple spec: selector: matchLabels: app: simple replicas: 1 template: metadata: labels: app: simple spec: containers: - name: nginx image: nginx A bunch of plugins will do some work during scheduling this Pod: PrioritySort -- is, basically, a Less function applied when Pod is added to queue(heap). NodeUnschedulable -- at Filter endpoint will filter out nodes with .spec.unshedulable set to true. DefaultPreemption -- the PostFilter is called only if Filter phase didn't found any feasible nodes for the Pod. DefaultPreemption, as the name suggests, tries to remove lower priority Pods to make scheduling, for Pod in processing, possible. During scoring, all plugins that implement extension point will be called to score feasible nodes. Shown in this example are: ImageLocality -- favoring nodes that already have the container image that Pod runs; NodeResourceFit -- by default using "least allocated"(max available resources) strategy to score nodes; NodeResourcesBalancedAllocation -- favors nodes with more balanced resource usage if Pod is scheduled there. DefaultBinder -- when feasible node is found, scheduler updates nodeName in Pod's spec. PrioritySort -- is, basically, a Less function applied when Pod is added to queue(heap). PrioritySort Less NodeUnschedulable -- at Filter endpoint will filter out nodes with .spec.unshedulable set to true. NodeUnschedulable Filter .spec.unshedulable true DefaultPreemption -- the PostFilter is called only if Filter phase didn't found any feasible nodes for the Pod. DefaultPreemption, as the name suggests, tries to remove lower priority Pods to make scheduling, for Pod in processing, possible. DefaultPreemption PostFilter Filter DefaultPreemption During scoring, all plugins that implement extension point will be called to score feasible nodes. Shown in this example are: ImageLocality -- favoring nodes that already have the container image that Pod runs; NodeResourceFit -- by default using "least allocated"(max available resources) strategy to score nodes; NodeResourcesBalancedAllocation -- favors nodes with more balanced resource usage if Pod is scheduled there. ImageLocality NodeResourceFit NodeResourcesBalancedAllocation DefaultBinder -- when feasible node is found, scheduler updates nodeName in Pod's spec. DefaultBinder nodeName Dangers of specifying NodeName field plugins: NodeName, NodeUnschedulable NodeName NodeUnschedulable The most straight-forward way of dealing with scheduling is setting the node in a Pod spec in nodeName field. However the behavior is somewhat unintuitive in this case. nodeName Set the worker node to be unschedulable: kubectl cordon kind-worker. Then deploy the nginx manifest below, which specifies the kind-worker node in the spec. You'll see that, despite the node being unschedulable, the Pod was still deployed and running on it. kubectl cordon kind-worker nginx kind-worker apiVersion: apps/v1 kind: Deployment metadata: name: nginx spec: selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: nodeName: kind-worker containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: nginx spec: selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: nodeName: kind-worker containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: nginx spec: selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: nodeName: kind-worker containers: - name: nginx image: nginx nodeName is intended for use by custom schedulers or advanced use cases where you need to bypass any configured schedulers. Bypassing schedulers might lead to failed Pods if the assigned Nodes get oversubscribed. You can use node affinity or the nodeSelector field to assign a Pod to a specific Node without bypassing the schedulers. nodeName node affinity nodeSelector Setting resource requirements plugins: NodeResourcesFit, NodeResourcesBalancedAllocation NodeResourcesFit NodeResourcesBalancedAllocation When deploying a Pod you can request and limit the usage of following resources: CPU, Memory and Ephemeral storage. Resource constraints play a significant role in scheduling but also in how they will be treated when there's not enough resources on a node(known as Quality of Service ^) CPU Memory Ephemeral storage ^ There's 2 plugins that play the main role during scheduling here: NodeResourcesFit filters the nodes that have all the resources that the Pod is requesting. NodeResourcesBalancedAllocation during thel scoring favors nodes that would obtain a more balanced resource usage if Pod is scheduled there. NodeResourcesFit filters the nodes that have all the resources that the Pod is requesting. NodeResourcesFit NodeResourcesBalancedAllocation during thel scoring favors nodes that would obtain a more balanced resource usage if Pod is scheduled there. NodeResourcesBalancedAllocation Apply this manifest: apiVersion: apps/v1 kind: Deployment metadata: name: nginx-126 spec: selector: matchLabels: app: nginx-126 replicas: 1 template: metadata: labels: app: nginx-126 spec: containers: - name: nginx image: nginx:1.26 resources: requests: cpu: "1" apiVersion: apps/v1 kind: Deployment metadata: name: nginx-126 spec: selector: matchLabels: app: nginx-126 replicas: 1 template: metadata: labels: app: nginx-126 spec: containers: - name: nginx image: nginx:1.26 resources: requests: cpu: "1" apiVersion: apps/v1 kind: Deployment metadata: name: nginx-126 spec: selector: matchLabels: app: nginx-126 replicas: 1 template: metadata: labels: app: nginx-126 spec: containers: - name: nginx image: nginx:1.26 resources: requests: cpu: "1" If you search kube-scheduler logs(if set to appropriate level) for the Pod name, you will see something like this: kube-scheduler pod="nginx1.26" plugin="NodeResourcesFit" node="kind-worker" score=91 pod="nginx1.26" plugin="NodeResourcesBalancedAllocation" node="kind-worker" score=93 ... pod="nginx1.26" plugin="NodeResourcesFit" node="kind-worker2" score=91 pod="nginx1.26" plugin="NodeResourcesBalancedAllocation" node="kind-worker2" score=93 pod="nginx1.26" plugin="NodeResourcesFit" node="kind-worker" score=91 pod="nginx1.26" plugin="NodeResourcesBalancedAllocation" node="kind-worker" score=93 ... pod="nginx1.26" plugin="NodeResourcesFit" node="kind-worker2" score=91 pod="nginx1.26" plugin="NodeResourcesBalancedAllocation" node="kind-worker2" score=93 If you used the kind config provided above, you will see the scores for all 4 worker nodes. As there was no other pods on these nodes, which means the identical allocatable resources on each, we see that NodeResourcesBalancedAllocation scores all nodes the same. kind config NodeResourcesBalancedAllocation Given that the nodes final score is the same, the one is chosen randomly^. In our case kind-worker2 node was selected. ^ kind-worker2 Now try to apply this manifest additionally: apiVersion: apps/v1 kind: Deployment metadata: name: nginx-127 spec: selector: matchLabels: app: nginx-127 replicas: 1 template: metadata: labels: app: nginx-127 spec: containers: - name: nginx image: nginx:1.27 resources: requests: cpu: "1" apiVersion: apps/v1 kind: Deployment metadata: name: nginx-127 spec: selector: matchLabels: app: nginx-127 replicas: 1 template: metadata: labels: app: nginx-127 spec: containers: - name: nginx image: nginx:1.27 resources: requests: cpu: "1" apiVersion: apps/v1 kind: Deployment metadata: name: nginx-127 spec: selector: matchLabels: app: nginx-127 replicas: 1 template: metadata: labels: app: nginx-127 spec: containers: - name: nginx image: nginx:1.27 resources: requests: cpu: "1" You will see different scoring for new Pod in the logs: pod="nginx1.27" plugin="NodeResourcesFit" node="kind-worker" score=91 pod="nginx1.27" plugin="NodeResourcesBalancedAllocation" node="kind-worker" score=93 ... pod="nginx1.27" plugin="NodeResourcesFit" node="kind-worker2" score=83 pod="nginx1.27" plugin="NodeResourcesBalancedAllocation" node="kind-worker2" score=87 pod="nginx1.27" plugin="NodeResourcesFit" node="kind-worker" score=91 pod="nginx1.27" plugin="NodeResourcesBalancedAllocation" node="kind-worker" score=93 ... pod="nginx1.27" plugin="NodeResourcesFit" node="kind-worker2" score=83 pod="nginx1.27" plugin="NodeResourcesBalancedAllocation" node="kind-worker2" score=87 As we already have a pod deployed on the kind-worker2 node -- we see that the scores given by NodeResourcesFit and NodeResourcesBalancedAllocation plugins are lower. Therefore, final score for other worker nodes is higher and Pod is scheduled to one of them. kind-worker2 NodeResourcesFit NodeResourcesBalancedAllocation Affinity rules The dictionary definition of affinity would say that it's "attractions or connection between things/ideas". So when we define affinity/anti-affinity rules in K8s, it's helpful to think about them as rules of attraction either to nodes or to pods that have certain characteristics. Node affinity plugins: NodeAffinity NodeAffinity The simplest example of affinity rule for nodes is nodeSelector field. Let's say we have nodes in different regions, and these nodes have label zone that carry that info: nodeSelector zone nodeSelector: zone: east nodeSelector: zone: east When Pod nodeSelector set, scheduler will only schedule the Pod to one of the nodes that has all the specified labels. nodeSelector But what if Pods should be deployed in the certain zone and it's preferred that pods are scheduled on nodes that have SSD storage. If nodeSelector set to: nodeSelector nodeSelector: zone: east disktype: ssd nodeSelector: zone: east disktype: ssd the Pod will end up unschedulable, as there's no node with SSD in the east zone. nodeSelector isn't expressive enough for selection logic with optional conditions. east nodeSelector For such cases affinity rules come to help: affinity rules apiVersion: apps/v1 kind: Deployment metadata: name: east-ssd spec: selector: matchLabels: app: east-ssd replicas: 1 template: metadata: labels: app: east-ssd spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - east preferredDuringSchedulingIgnoredDuringExecution: - weight:1 preference: matchExpressions: - key: disktype operator: In values: - ssd containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: east-ssd spec: selector: matchLabels: app: east-ssd replicas: 1 template: metadata: labels: app: east-ssd spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - east preferredDuringSchedulingIgnoredDuringExecution: - weight:1 preference: matchExpressions: - key: disktype operator: In values: - ssd containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: east-ssd spec: selector: matchLabels: app: east-ssd replicas: 1 template: metadata: labels: app: east-ssd spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - east preferredDuringSchedulingIgnoredDuringExecution: - weight:1 preference: matchExpressions: - key: disktype operator: In values: - ssd containers: - name: nginx image: nginx In manifest above the following affinity rules are set: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - east preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: disktype operator: In values: - ssd affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - east preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: disktype operator: In values: - ssd There's 2 types of affinity rules: requiredDuringSchedulingIgnoredDuringExecution -- filter out nodes based on provided rules. Important note: rules in nodeSelectorTerms are ORed, rules in matchExpressions are ANDed. Besides In, other operators are available: NotIn, Exists, DoesNotExist. Gt, and Lt [^](Assigning Pods to Nodes | Kubernetes) preferredDuringSchedulingIgnoredDuringExecution -- scores nodes based on provided weighted rules. As was mentioned before, each plugin that implements score extension point will return a score for each feasible(the ones that passed filter stage) node. Affinity plugins(both Node and InterPod) return score based on weights provided in manifests. Therefore, in given example, it's not guaranteed, even if node with SSD is present, that it will be chosen for Pod scheduling. It will be scored and compared with other feasible nodes. requiredDuringSchedulingIgnoredDuringExecution -- filter out nodes based on provided rules. Important note: rules in nodeSelectorTerms are ORed, rules in matchExpressions are ANDed. Besides In, other operators are available: NotIn, Exists, DoesNotExist. Gt, and Lt [^](Assigning Pods to Nodes | Kubernetes) requiredDuringSchedulingIgnoredDuringExecution rules in nodeSelectorTerms matchExpressions In NotIn, Exists, DoesNotExist. Gt, and Lt Assigning Pods to Nodes | Kubernetes preferredDuringSchedulingIgnoredDuringExecution -- scores nodes based on provided weighted rules. As was mentioned before, each plugin that implements score extension point will return a score for each feasible(the ones that passed filter stage) node. Affinity plugins(both Node and InterPod) return score based on weights provided in manifests. Therefore, in given example, it's not guaranteed, even if node with SSD is present, that it will be chosen for Pod scheduling. It will be scored and compared with other feasible nodes. preferredDuringSchedulingIgnoredDuringExecution score filter stage Node InterPod Important note: both rules have suffix IgnoredDuringExecution meaning that if conditions change after Pod was already scheduled, for example the node label value will be changed, the Pod will not be rescheduled. both rules have suffix IgnoredDuringExecution Pod affinity plugins: InterPodAffinity InterPodAffinity There's scenarios when co-location of different services is desired. For example, there could be interdependent microservices, that constantly communicate. Placing such workloads close to each other would improve performance(minimizing latency). Pod affinity allows to define such constraints of the form "this Pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more Pods that meet rule Y, where X is a topology domain like node, rack, cloud provider zone or region, or similar"^. Pod affinity topology domain ^ Pod affinity rules are similar to node affinity. The names are same: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. But rules have additional required field topologyKey -- which should point to node's label, based on which co-location is defined. Pod affinity node affinity requiredDuringSchedulingIgnoredDuringExecution preferredDuringSchedulingIgnoredDuringExecution topologyKey apiVersion: apps/v1 kind: Deployment metadata: name: colocated-app1 spec: selector: matchLabels: app: colocated-app1 replicas: 1 template: metadata: labels: app: colocated-app1 spec: containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: colocated-app1 spec: selector: matchLabels: app: colocated-app1 replicas: 1 template: metadata: labels: app: colocated-app1 spec: containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: colocated-app1 spec: selector: matchLabels: app: colocated-app1 replicas: 1 template: metadata: labels: app: colocated-app1 spec: containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: colocated-app2 spec: selector: matchLabels: app: colocated-app2 replicas: 1 template: metadata: labels: app: colocated-app2 spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - colocated-app1 topologyKey: zone containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: colocated-app2 spec: selector: matchLabels: app: colocated-app2 replicas: 1 template: metadata: labels: app: colocated-app2 spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - colocated-app1 topologyKey: zone containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: colocated-app2 spec: selector: matchLabels: app: colocated-app2 replicas: 1 template: metadata: labels: app: colocated-app2 spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - colocated-app1 topologyKey: zone containers: - name: nginx image: nginx In these examples Pod affinity is defined for app2 as: Pod affinity app2 affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - colocated-app1 topologyKey: zone affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - colocated-app1 topologyKey: zone Based on this, the colocated-app2will be scheduled to the node that has the same value in zone label(topology domain) as the one on which app1 is deployed. colocated-app2 zone topology domain app1 Pod anti-affinity plugins: InterPodAffinity InterPodAffinity Another scenario is when it's preferred that Pods are deployed in different topology domains(for example for availability) -- so there's anti-affinity between Pods. topology domains Pod anti-affinity rules are the same ones that are used for Pod affinity: Pod anti-affinity Pod affinity apiVersion: apps/v1 kind: Deployment metadata: name: aaapp spec: selector: matchLabels: app: aaapp replicas: 2 template: metadata: labels: app: aaapp spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - aaapp topologyKey: zone containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: aaapp spec: selector: matchLabels: app: aaapp replicas: 2 template: metadata: labels: app: aaapp spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - aaapp topologyKey: zone containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: aaapp spec: selector: matchLabels: app: aaapp replicas: 2 template: metadata: labels: app: aaapp spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - aaapp topologyKey: zone containers: - name: nginx image: nginx affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - aaapp topologyKey: zone affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - aaapp topologyKey: zone The rule is identical to how it would be defined in Pod affinity, with exception that it's part of podAntiAffinity kind. Therefore, 2 replicas of Anti-affinity app will be scheduled to the nodes in different topology domains. Pod affinity podAntiAffinity topology domains Pod Topology Spread plugins: PodTopologySpread PodTopologySpread Pod Topology Spread(PTS), on the first glance, can be similar to affinity rules. But, in fact, it's a very different concept. Affinity rules are concerned with attraction between Pods and nodes -- or in simple terms, keeping Pods close or at a distance from each other. PTS is about controlling evenness of distribution of Pods across different topology domains. Pod Topology Spread(PTS) affinity rules Affinity rules attraction PTS topology domains Let's utilize the cluster from Affinity examples: Affinity Assume we want an even distribution across east and west zones -- so that when we scale an app from 2 to 4 replicas, or 4 to 8, the same number of replicas will be running in each zone. east west apiVersion: apps/v1 kind: Deployment metadata: name: dapp spec: selector: matchLabels: app: dapp replicas: 6 template: metadata: labels: app: dapp spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: dapp containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: dapp spec: selector: matchLabels: app: dapp replicas: 6 template: metadata: labels: app: dapp spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: dapp containers: - name: nginx image: nginx apiVersion: apps/v1 kind: Deployment metadata: name: dapp spec: selector: matchLabels: app: dapp replicas: 6 template: metadata: labels: app: dapp spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: dapp containers: - name: nginx image: nginx topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: dapp topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: dapp maxSkew field defines the degree of unevenness of distribution of Pods between domains. whenUnsatisfiable defines either scheduler use skew for filtering nodes out(DoNotSchedule) or for prioritizing nodes which cause min skew(ScheduleAnyway). For example, with maxSkew: 1 and DoNotSchedule strategy, assume we already have 3 replicas distributed between east and west zones -- 2 pods in west and 1 in the east, so the skew between domains is 1. If we add another replica -- it can't be placed in west zone, as skew will be greater than maxSkew, so it has to be placed in east zone. maxSkew domains whenUnsatisfiable skew DoNotSchedule ScheduleAnyway maxSkew: 1 DoNotSchedule east west west east west maxSkew east PTS can be useful for achieving: PTS resilience -- if one zone is down, workload is still available, by being deployed in another zone. balanced resource utilization -- ensuring that no single topology domain becomes a bottleneck and keeping app close to consumers, optimizing network latency. resilience -- if one zone is down, workload is still available, by being deployed in another zone. resilience balanced resource utilization -- ensuring that no single topology domain becomes a bottleneck and keeping app close to consumers, optimizing network latency. balanced resource utilization topology domain