by- Prashant Ghildiyal, Co-Founder — Devtron Labs
Cost-saving has always been one of the important objectives for organizations, but now it is more important than ever before. Because of the uncertainty in the business world, the earlier motto of "growth at all costs" has been replaced with "responsible growth".
This post will focus on how you can leverage spot instances of AWS for cost saving in Kubernetes clusters without compromising on stability.
You must be thinking that this is trivial as Kubernetes supports it out of the box: hold your thoughts for a while. I promise you, by the end of this post, you will know the best possible solution to use spot instances in Kubernetes clusters using mechanisms provided by Kubernetes.
Spot instances of AWS are usually available at 10% of the cost of on-demand instances, but their reliability is low. If the price of spot instances goes beyond your bidding price, they will be terminated by AWS within 2 mins. Therefore we must distribute pods of our microservice judiciously across spot and on-demand instances.
How to handle termination notification and drain resources? Keep in mind that it is also important to maintain the SLA of microservices but that is beyond the scope of this article. We will cover that in a separate article.
Before we go into details, let's understand the autoscaling of nodes in the Kubernetes cluster.
Without much further ado, let’s start our journey.
Spoiler Alert: First two attempts are failures, and the third attempt is successful.
Based on my discussion, this is the second most popular approach to use spot instances with Kubernetes. It goes like this.
If nodes have the right spot-is-to-on demand ratio then pods will automatically have the right ratio.
Kops supports the mixed instance group for AWS since version 1.14. Mixed Instance groups can be used to achieve the right ratio of spot and on-demand instances.
Let’s look at a relevant part of a sample instance group configuration.
spec:
mixedInstancesPolicy:
onDemandBase: 3
onDemandAboveBase: 30
spotAllocationStrategy: capacity-optimized
nodeLabels:
lifecycle: Spot
As per the above configuration minimum of 3 demand nodes will be available. For additional requirements, 30% of the nodes will be on-demand type, and the rest 70% will be of spot type.
For node affinity, the following is the relevant portion of the pod spec
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: lifecycle
operator: In
values:
- Spot
Case 1: Scaling not Required
If the cluster can schedule the pod, the Kube scheduler will use its filter and priority algorithms to schedule the pod on the best possible node.
The priority algorithm of the Kube scheduler doesn’t differentiate between spot and on-demand nodes. Therefore, even though the nodes will be in approx 70-is-to-30 ratio for spot-is-to-on demand, the distribution of pods across these nodes may not be in this ratio.
Case 2: Scaling Required
If the cluster doesn’t have the capacity to schedule the pod, then the cluster auto-scaler will increase the desired instances in ASG.
ASG will then provision a new node so that the ratio of 70-is-to-30 ratio for spot-is-to-on demand is maintained.
After provisioning, the Kube scheduler will schedule the pod to the new node, assuming there were no pod evictions in between. So in case of a scaling event, the pod will be assigned to the right kind of node.
Can we do better?
We can use inter pod anti-affinity for better distribution of pods, but it will still not guarantee distribution of 70-is-to-30 unless the number of pods is equal to the number of nodes.
Outcome: failure
Even though nodes will have the desired spot-is-to-on demand ratio, pods may or may not be spread in this ratio, resulting in unstable services in case of a spot node outage.
This is the most often cited approach to use spot instances in Kubernetes clusters.
Use node affinity to control distribution of pods across spot and on demand nodes.
For this to work, at least two node groups are needed, one with spot instances only and on-demand instances only.
Following are the relevant configurations from the two node groups; this can be done without a mixed instance node group.
For spot:
spec:
mixedInstancesPolicy:
onDemandBase: 0
onDemandAboveBase: 0
spotAllocationStrategy: capacity-optimized
nodeLabels:
lifecycle: Spot
Similarly, for on-demand:
spec:
mixedInstancesPolicy:
onDemandBase: 3
onDemandAboveBase: 100
nodeLabels:
lifecycle: OnDemand
Following is the relevant pod spec for node affinity:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 70
preference:
matchExpressions:
- key: lifecycle
operator: In
values:
- Spot
- weight: 30
preference:
matchExpressions:
- key: lifecycle
operator: In
values:
- OnDemand
What does weight mean?
Weight of 30 and 70 doesn’t mean the scheduler will distribute pods between these two node labels in the ratio of 30-is-to-70.
Scheduler combines the weight mentioned in the above spec with the ones it has computed using priority functions and assigns pod to node group with the highest score.
Case 1: Scaling not required
If scaling is not required, it will prefer to schedule on spot instances as it weighs 70, though actual placement may depend on the score obtained through various priority functions used by the Kube scheduler.
Case 2: Scaling required
If the scaling of nodes is required to schedule the pod, then the cluster auto-scaler will filter all node groups and prioritize eligible node groups based on its priority algorithm.
Priority algorithms used by the cluster auto-scaler are not the same as the Kube scheduler. By default, it uses the random algorithm to pick one node group out of eligible node groups randomly. It will select one node group in random order and increase the desired instance count in ASG.
Once ASG has provisioned the node, the Kube scheduler will assign the pod to this new node.
Can we do better?
No, pod anti-affinity will not help as the ratio of spot-is-to-on demand is not equal to the desired ratio.
Outcome: failure
Neither node will not have the desired spot-is-to-on demand ratio, nor will pods have the desired ratio. This turns out to be worse than attempt 1.
This is the least-mentioned approach; it uses Pod Topology Spread Contraints, which was introduced in Kubernetes 1.16 and was made beta in 1.18. We will use pod topology spread constraints to control how pods are spread across the spot and on-demand instances in the cluster.
Following are the relevant configurations from the two node groups; again, this can be done without a mixed instance node group.
For spot:
spec:
mixedInstancesPolicy:
onDemandBase: 0
onDemandAboveBase: 0
spotAllocationStrategy: capacity-optimized
nodeLabels:
lifecycle: Spot
Similarly, for on-demand:
spec:
mixedInstancesPolicy:
onDemandBase: 3
onDemandAboveBase: 100
nodeLabels:
lifecycle: OnDemand
Following is the relevant portion of the pod spec for pod topology constraints:
metadata:
labels:
app: sample
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: lifecycle
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: sample
How topologySpreadConstraints work?
Topology Spread Constraints uses node labels to identify the topology domain(s) of the node. topologyKey is the key of node labels. Kube scheduler tries to place a balanced number of pods across all unique values of this node label (topologyKey).
In our example, topologyKey is lifecycle which has 2 unique values: Spot and OnDemand. Kube Scheduler will place pods across nodes with these two values such that the maximum difference between pod count across these two values cannot be more than maxSkew (1 in this case).
If this label were missing in any node group, the scheduler would not have scheduled a pod to that node group.
An important point to note is that maxSkew doesn’t favor any label value against topologyKey. It will skew in either direction based on the availability and priority of nodes. Though it will ensure that unbalance between label values is not more than maxSkew.
whenUnsatisfiable is set to DoNotSchedule, the Kube scheduler ensures that the pod is not scheduled so that maxSkew cannot be maintained.
Case 1: Scaling not required
If scaling is not required, the Kube-scheduler will filter and prioritize node groups that honor the maxSkew; pods will be scheduled in the desired ratio.
Case 2: Scaling required
When scaling is required, cluster auto-scaler will filter node groups that honor the topology constraints and increment desired instance number in related ASG.
After ASG has scaled the instance, the Kube scheduler will assign the pod to the node; therefore, pods will be scheduled in the desired ratio.
What’s the catch?
maxSkew is a number, which means that when we use it with HPA as pods will scale, the ratio of spot-is-to-on demand will change.
maxSkew can be on either side; it is possible to have
number of pods on spot = number of pods on ondemand + maxSkewnumber of pods on ondemand = number of pods on spot + maxSkew
This means, for a replica count of 5 and maxSkew of 1, the ratio can be 2-is-to-3 or 3-is-to-2 for spot-is-to-on demand nodes. This becomes more unpredictable as the value of maxSkew becomes higher.
To achieve higher skew, it's better to create more buckets with the topologyKey.
Outcome: failure
We cannot get an exact spot-is-to-on-demand ratio, but we can have a predictable ratio nonetheless.
And that is how you can leverage spot instances of AWS for cost saving in Kubernetes clusters without compromising on stability. Complete configuration of samples used in this blog is available in this git repo.