AWS has just made a new open-source tool called Karpenter generally available. I will dig into how it works and will be unavoidably comparing it with what we have built at CAST AI.
What exactly is Karpenter, how does its autoscaling mechanism work, and should you use it?
Karpenter is a very simple autoscaler with two reactive policies. Its biggest selling point? No node pools.
Let’s take a look at how Karpenter works and compare it to the autoscaling mechanisms of CAST AI.
After trying Karpenter out, I’d say that onboarding isn’t trivial. Still, it’s not as hard as the Kubernetes Cluster Autoscaler.
Karpenter onboarding requires you to have eksctl CLI installed and configured, even if you created an EKS cluster without the help of eksctl. It takes around 6 to 7 steps to get started with Karpenter.
CAST AI doesn’t have an eksctl dependency, and you’re looking at a 2-step process your IT manager could complete on their own in a minute or two.
Karpenter has only two of the most basic autoscaling policies. At first glance, they are quite similar to two of the policies in CAST AI (Unscheduled Pod Policy and Node Deletion policy). Let’s compare the two.
1. Provisioner
In its simplest form (without sub-features), CAST AI’s Unscheduled Pod Policy works in a similar way as Karpenter’s Provisioner.
If there are any pods that can’t start because no suitable node has been found by the Kubernetes vanilla scheduler, a new node will be created based on those pending pod constraints like zone, subnet, PodAntiAffinity, capacity preference (spot or on-demand).
Karpenter has these constraints available for NodeSelector, NodeAffinity, TopologySpreadConstraints:
requirements:
- key: node.k8s.aws/instance-type #If not included, all instance types are considered
operator: In
values: ["m5.large", "m5.2xlarge"]
- key: "topology.kubernetes.io/zone" #If not included, all zones are considered
operator: In
values: ["us-east-1a", "us-east-1b"]
- key: "kubernetes.io/arch" #If not included, all architectures are considered
values: ["arm64", "amd64"]
- key: " karpenter.sh/capacity-type" #If not included, the webhook for the AWS cloud provider will default to on-demand
operator: In
values: ["spot", "on-demand"]
Karpenter supports GPU and ARM with the following tolerations: nvidia.com/gpu, amd.com/gpu, and aws.amazon.com/neuron. To add a GPU node, one needs to define a specific list of nodes (and we’re back to the node pool idea).
CAST AI allows you to specify more pod constraints, like get me instances with fast local storage for Spark jobs but doesn’t require you to specify the instance types on your own. In case you haven’t checked lately, there are 475 available types on AWS, which makes it easy to get back to implicit Node Pools by accident.
You can also specify the spot interruption rate per cluster and override per workload in CAST AI. For example, you may want your casual spot workload in a cluster to be on more cost-efficient spot instances – even if that means these instances could be interrupted once in a while.
But those huge Jupyter notebook pods with a ton of resources? You want them to run spot instances that are so unpopular that they barely get ever interrupted.
Back to the Node Pools against one’s will
Running your Kubernetes workloads on burstable T3a instance types is a crime to your business and DevOps. Unfortunately, this is what you will get by default with Karpenter as they are the most affordable instances.
Not excited about burstable instances? You’ll have to list all the acceptable instance types that you want your Kubernetes cluster to have in the Karpenter Provisioner CRD. And with that, you have just specified a cherry-picked node list and are implicitly back to Node Pools.
And what if these cherry-picked node lists are not available during Black Friday or Christmas holidays? You will need to extend the list with even more alternatives.
2. Delete Empty Node policy
As the name implies, Karpenter can and will delete empty nodes. This policy allows you to set the Time To Live for empty nodes (ttlSecondsAfterEmpty with a default of 30).
CAST AI’s Delete Node policy works in the same fashion. The CAST AI engine will not remove nodes if there are workloads running on them (excluding DaemonSets of course).
The Kubernetes scheduler by default acts fairly and will always distribute workloads for maximum performance and availability. And if you’re running EKS, you can’t change the Kubernetes scheduler profile.
Let’s look at this with an analogy:
Imagine a train (Kubernetes cluster) with 10 train cars (nodes); each train car can seat 100 passengers (pods), the train has a capacity to transport 1000 passengers at peak.
The train moves from the central station fully utilized with 1000 seats taken and with every stop more people get off the train than on it. Even a few stops before the end of the train journey, if the train is carrying only 15 passengers out of 1000, someone needs to make sure that there are empty train cars. It’s likely that there are 1 to 2 passengers in a train car enjoying their privacy, and the Delete Empty Node train car can’t find anything to remove.
So how does Karpenter help with over-provisioned resources for off-peak times when the load is lower?
Enter Node Expiry
Karpenter offers a Node Expiry feature. If a node expiry time-to-live value (ttlSecondsUntilExpired) is reached, that node is drained of pods, and the node is deleted.
Great! But wait a minute, what about your workloads that are running on these nodes? Sorry, end of shift. Nodes are going down.
Alright, but how would you know what the expiration time should be? If you get it too short, your pods are going down, you then have to wait several minutes for the Karpenter provisioner to add new nodes with refreshed ttlSecondsUntilExpired.
If you set the ttlSecondsUntilExpired too long, you’re wasting money. So what we see here is that Karpenter reacts in time to add new capacity but leaves you hanging with reducing any excess capacity.
CAST AI, on the other hand, offers Evictor that shrinks the cluster to the minimum number of nodes by bin packing pods (simulating evictions, honoring Pod Disruption Budgets, etc). Once Evictor makes a node empty, it’s deleted automatically.
At the moment, Karpenter only supports AWS. The code has a lot of AWS-specific hardcoded logic. For instance, using EC2 Fleet and Spot Fleet, without a way forward for other clouds.
On the other hand, Karpenter has empty interfaces created in the code so the project looks pluggable and open enough if someone were to take the initiative of adding other cloud provider support.
CAST AI supports AWS (EKS, kOps), Google Cloud Platform (GKE), and Microsoft Azure (AKS).
The promise of being free of node pools sounds great. But the way it is now, Karpenter doesn’t offer users almost any control over the scheduling decisions when using a “naked” Provisioner.
Provisioner is Karpenter’s Kubernetes CRD which defines the node creation constraints. For example, if you had workloads that needed a GPU and you wanted the GPU nodes to have a taint (so that only the pods that actually need it are placed on those nodes), you would create a Provisioner like that:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: gpu
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- p3.2xlarge
- p3.8xlarge
- p3.16xlarge
taints:
- key: nvidia.com/gpu
value: true
effect: “NoSchedule”
In Karpenter, the engineer still needs to have deep knowledge of the available instance types with all of their nuances.
To use the SSDs in a K8s cluster, you would need to create a user-data script that makes a single logical disk out of those local SSDs and creates a symbolic link for the kubelet to use. That requires creating an EC2 Launch Template and referencing it in the Provisioner resource. And, most importantly, you need to keep it up to date if you want to achieve cost savings.
Conclusion: In more complicated clusters, you’d have to create many different Provisioners, i.e., node pools: one for GPUs, another one for storage optimized, yet another one for infrequently interrupted Spot instances, etc.
Provisioner Conflicts
When there are multiple Provisioner resources defined and an unschedulable pod can tolerate/select any of them, the result will be non-deterministic.
Here’s an example:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: team-a
spec:
labels:
team: a
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["m5.large"]
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: team-b
spec:
labels:
team: b
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["m5.2xlarge"]
---
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
containers:
- image: nginx
name: pod
The nginx pod would be scheduled on either the m5.large or the m5.2xlarge node (chosen based on the sequence the API-Server returns for the records). This is another dimension that the DevOps engineer would need to think about.
The Provisioners should clearly define constraints, and all pods should clearly define the selectors to reduce or eliminate randomness.
In CAST AI, the selectors/tolerations are predefined, and engineers don’t have to come up with them on their own.
Karpenter’s GitHub states that the solution uses bin packing, but only in the nodePlacer sense. That is, it groups pods to be placed on nodes to avoid creating a separate node for each pod.
In CAST AI, we use Permutations and Clique functions. Their purpose is to find out the most optimal pod groupings.
Karpenter’s algorithm is basically creating a hash of the pod constraints. This means that otherwise compatible pods will end up on different nodes (Karpenter will create multiple nodes).
Consider this example:
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
nodeSelector:
topology.kubernetes.io/zone: "eu-central-1a"
containers:
- image: nginx
name: pod
---
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- "eu-central-1a"
- "eu-central-1b"
containers:
- image: nginx
name: pod
Even though the pods could fit into a single node placed in the eu-central-1a zone, Karpenter will create two nodes (that might even end up in the same zone) because the hash of the pod constraints is different.
When Karpenter bin-packs pods, it produces a list of viable instance types per each pod group. For example, if there’s a pod group that needs 2 CPU and 8 GiB, then a list of all instance types that have at least 2 CPU and 8 GiB is retrieved. The list is sorted by resource weight.
Karpenter sorts those instance types using a weight function, where 1 CPU = 1 GiB, favoring nodes with higher CPU and lower memory:
// weightOf uses a euclidean distance function to compare the instance types.
// Units are normalized such that 1cpu = 1gb mem. Additionally, accelerators
// carry an arbitrarily large weight such that they will dominate the priority,
// but if equal, will still fall back to the weight of other dimensions.
func weightOf(instanceType cloudprovider.InstanceType) float64 {
return euclidean(
float64(instanceType.CPU().Value()),
float64(instanceType.Memory().ScaledValue(resource.Giga)), // 1 gb = 1 cpu
float64(instanceType.NvidiaGPUs().Value())*1000, // Heavily weigh gpus x 1000
float64(instanceType.AMDGPUs().Value())*1000, // Heavily weigh gpus x 1000
float64(instanceType.AWSNeurons().Value())*1000, // Heavily weigh neurons x 1000
)
}
After sorting, only the first 20 nodes are picked.
This list might not contain the cheapest instance types because of Karpenter’s particular weighing system. There is no inventory database that can tell exactly how much something costs.
After the bin-packing and sorting, the list is passed to the EC2 Fleet API. The mode of choosing the instance type depends on whether it’s an on-demand node or spot instance.
On-demand instances
In this case, Karpenter passes the “lowest cost” allocation method that makes the EC2 Fleet API choose the cheapest instance type out of the 20 received viable instance types. There might be cheaper instance types, but the API works with what it was given.
Spot instances
Karpenter passes the “capacity optimized prioritized” allocation method that makes the EC2 Fleet API choose the instance types that have the highest availability, prioritizing the instance types at the top of the list. This has nothing to do with cost optimization and could be considered something similar to the CAST AI “Least interrupted” Spot reliability setting.
CAST AI offers an entirely different approach to cost optimization. The platform selects the most cost-efficient instances that match the workload requirements and automatically rightsizes instances to avoid overallocation. CAST AI also comes with features such as cost reporting on any level, information about cluster cost, or potential savings. Karpenter doesn’t have any of those.
As you can see, Karpenter offers a limited set of functionalities right now.
AWS Karpenter is an option if you’re on AWS and considering the native Kubernetes Cluster Autoscaler. Today, AWS offers 400+ instances, so node pools on which the Kubernetes Cluster Autoscaler works become unmanageable considering the exhaustible instance inventory. This is essentially what Karpenter kind of tries to solve.
If you’re looking for an autoscaler that can significantly reduce your cloud bill for production workloads without impacting their availability, you need to look elsewhere.
CAST AI offers a substantially more robust set of features which makes it a great fit for reliable automated cost reduction on production and development workloads. It also works with all three major cloud providers: AWS, Google Cloud, and Azure.
To see what results CAST AI could bring you, you can register and start with a free cluster analysis or book a quick demo by clicking the button below.