By nature, pods in Kubernetes clusters are ephemeral. They can be created, killed, and moved around by the scheduler. This may occasionally cause disruption in the microservices if pods are not configured properly.
In this article, we will look at two scenarios which will impact the stability of pod because of pod eviction:
And how we can secure our pods by ensuring:
There is no direct method to specify Quality of Service (QoS) of pods. Kubernetes determines quality of service based on the resource request and limit of the pods.
Each container specifies a request for resource, which is the amount of resource that is guaranteed by the Kubernetes, and a limit for resource which is the maximum amount of resource Kubernetes will allow the container to use.
Pod level request and limit are computed by adding per-resource level requests and limits across all containers of the pod. Kubernetes currently provides three QoS based on pod level request and limit
Guaranteed
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-nginx
namespace: demo
spec:
containers:
- name: guaranteed-nginx
image: nginx
resources:
limits:
memory: "512Mi"
cpu: "1024m"
requests:
memory: "512Mi"
cpu: "1024m"
Burstable
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-nginx
namespace: demo
spec:
containers:
- name: guaranteed-nginx
image: nginx
resources:
limits:
memory: "1024Mi"
requests:
memory: "512Mi"
Best Effort
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-nginx
namespace: demo
spec:
containers:
- name: guaranteed-nginx
image: nginx
Kubernetes exposes two specs to define priority of pods: priority and priorityClassName. This is used along with spec preemptionPolicy, which can have value Never or PreemptLowerPriority.
Pod with higher priority is placed ahead in the scheduling. If preemptionPolicy is set to PreemptLowerPriority and no node is found which satisfies requirements of pod, then scheduler will evict lower priority pods to create space for it.
PriorityClass config with preemption disabled
apiVersion: scheduling.k8s.io/v
kind: PriorityClass
metadata:
name: high-priority
preemptionPolicy: Never
value: 1000000
globalDefault: false
description: "This priority class will not preempt other pods"
Pod with priority class
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: demo
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority
Let’s analyze the role of Quality of Service and Pod Priority with respect to the stability of pods for preemption and eviction.
Pod Preemption
When pods are created, they are placed into the scheduling queue based on their priority. Scheduler picks up a pod for scheduling and filters nodes based on the requirements specified by the pod. If the scheduler is unable to find any suitable node for the pod, then preemption logic is invoked for the pending pod provided preemptionPolicy for pending pod is not set to Never.
Preemption logic tries to find nodes which have lower priority pods than the pending pod so that pending pod can be scheduled on this node after
removal of low priority pods.
Quality of Service doesn’t have any impact on pod preemption; it is affected by the pod priority and preemptionPolicy.
1. Setting up Priority for First Time on Existing Cluster
When you are setting up pod priority for the first time, you must start with the pods with highest priority or keep preemptionPolicy as Never. Because he default priority of the pod is 0, if you set priority for low priority pod first then it may preempt critical pod which may not have priority set and may result in outage.
Grafana faced ~30 minutes outage as blogged here, which was attributed to applying pod priority in the wrong order.
2. PodDisruptionBudget (PDB) is Not Guaranteed
PDB is only on a best effort basis. It will try to find a node such that eviction of lower priority pods will not result in violation of PDB. If it is unable to find any such node, then it will evict low priority pods from node to schedule high priority pod even if eviction results in violation of PDB and may result in outage.
So, what purpose is served by PodDisruptionBudget?
PodDisruptionBudget comes into picture in case of voluntary disruption for eg node drain or downscale during cluster autoscaling. PodDisruption budget limits the number of pods of an application that can be down simultaneously thereby ensuring quality of service is not impacted.
3. Affinity with Low Priority Pod
In case high priority pod (H) has inter pod affinity with lower priority pod (L), it is possible that scheduler may end up evicting L from the node in order to make space for H. If it happens then inter pod affinity will no longer be satisfied and H will not be scheduled on this node. This loop can continue and can have a negative impact on availability of services.
You can avoid it by ensuring that pod with preemptionPolicy PreemptLowerPriority has inter pod affinity with pod of equal or higher priority.
4. Preemption May Not Follow Strict Priority Order
Scheduler finds nodes with lower priority pods so it can run pending pod after eviction of lower priority pods. If it’s not feasible to run pending pod on the node with low priority pods then it may select a node with higher priority pod (priority of these pods may be higher than pod on other nodes but will be lower compared to pending pods).
To run pending pod, scheduler attempts to select nodes with lowest priority pods but if it’s not possible to run pending pod on the node after eviction or those pods are protected by pod disruption budget then it will evict higher priority pods.
Out of Resource Eviction
In over committed nodes, pods will be killed if the system runs out of resources. Kubelet proactively monitors compute resources for eviction. It supports eviction decisions based on incompressible resources, namely
Eviction doesn’t happen if pressure is on compressible resources for e.g. CPU.
Kubernetes allows us to define two thresholds to control the eviction policy of the pods.
Soft Eviction Threshold
If soft eviction threshold is reached then pods are evicted with grace period. Grace period is calculated as minimum of the pod termination grace period and soft eviction grace period. If soft eviction grace period is not specified then pods are killed immediately.
Hard Eviction Threshold
If hard eviction threshold is reached then pods are evicted immediately without any grace period.
Eviction Policy
In case of imagefs or nodefs pressure, it sorts pods based on the local volumes + logs + writable layers of all containers.
In case of memory pressure, pods are sorted first based on whether their
memory usage exceeds their request or not, then by pod priority and then by consumption of memory relative to memory requests. Pods which don’t exceed memory requests are not evicted. A lower priority pod which doesn’t exceed memory request will not be evicted. Whereas, a higher priority pod which exceeds memory request will be evicted.
Node Out of Memory (OOM) Kill
In case a node experiences OOM behaviour prior to Kubelet being able to
reclaim memory, the node depends on oom_killer to respond.
oom_killer calculates oom_score such that containers with the lowest quality of service that are consuming the largest amount of memory relative to memory request should be killed first.
Kubelet may restart OOM killed pods depending on the restart policy unlike eviction.
Previously published on: https://devtron.ai/blog/ultimate-guide-of-pod-eviction-on-kubernetes/