Listen to this story
Lead DevOps Engineer, Co-Founder of ReferrsMe & CrowdFind
Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.
Running GPU workloads on Amazon EKS requires configuring GPU-enabled nodes, installing necessary drivers, and ensuring proper scheduling. Follow these steps to set up GPU nodes in your EKS cluster.
First, create an EKS cluster without worker nodes using eksctl
(for simplicity, we don’t use Terraform/OpenTofu ):
eksctl create cluster --name kvendingoldo–eks-gpu-demo --without-nodegroup
A separate CPU node group ensures that
Kubernetes system components (kube-system
pods) have a place to run.
The GPU Operator and its dependencies will be deployed successfully.
Non-GPU workloads don’t end up on GPU nodes.
Create at least one CPU node to maintain cluster stability:
eksctl create nodegroup --cluster kvendingoldo–eks-gpu-demo \
--name cpu-nodes \
--node-type t3.medium \
--nodes 1 \
--nodes-min 1 \
--nodes-max 3 \
GPU nodes should have appropriate taints to prevent non-GPU workloads from running on them. Use an NVIDIA-compatible instance type (you can check all options at, but typically, it’s g4dn.xlarge
or p3.2xlarge
) for such nodes:
eksctl create nodegroup --cluster kvendingoldo–eks-gpu-demo \
--name gpu-nodes \
--node-type g4dn.xlarge \
--nodes 1 \
--node-taints only-gpu-workloads=true:NoSchedule \
A custom taint only-gpu-workloads=true:NoSchedule
guarantees that only pods with the same toleration configuration are scheduled on these nodes.
The NVIDIA GPU Operator installs drivers, CUDA, toolkit, and monitoring tools. To install it, use the following steps:
Create gpu-operator-values.yaml:
- key: "only-gpu-workloads"
value: "true"
effect: "NoSchedule"
Deploy the gpu-operator via Helm:
helm repo add nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator -f gpu-operator-values.yaml
Pay attention to two things:
values, we set up tolerations for the gpu-operator daemonset; without that, nodes will not work, and you won’t be able to schedule GPU workloads there.After deploying the GPU Operator, check if NVIDIA devices are correctly detected on the GPU by the following command:
kubectl get nodes -o json | jq '.items[].status.allocatable' | grep nvidia
If you need to manually debug a GPU node, connect using AWS SSM (Systems Manager Session Manager) instead of SSH.
Ensure your EKS worker nodes have the AmazonSSMManagedInstanceCore
aws iam attach-role-policy --role-name <NodeInstanceRole> \
--policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Find the Instance ID of your GPU node:
aws ec2 describe-instances --filters "Name=tag:eks:nodegroup-name,Values=gpu-nodes" \
--query "Reservations[].Instances[].InstanceId" --output text
Start the AWS SSM session:
aws ssm start-session --target <Instance-ID>
Inside the node, check the GPU state:
lspci | grep -i nvidia
to check if the GPU hardware is detectednvidia-smi
to verify the NVIDIA driver and GPU status
If Nvidia-smi fails or the GPU is missing, it may indicate that:
Deploy a test pod to verify GPU scheduling. This pod:
apiVersion: v1
kind: Pod
name: kvendingoldo-gpu-test
- key: "only-gpu-workloads"
value: "true"
effect: "NoSchedule"
nodeSelector: "true"
- name: cuda-container
image: nvidia/cuda:12.0.8-base
command: ["nvidia-smi"]
limits: 1
Typically, users may face failing pods and errors like:
0/2 nodes are available: 1 Insufficient
It means that all GPUs are already allocated or Kubernetes is not recognizing available GPUs. Here, the following fixes may help.
Check GPU Allocations kubectl describe node <gpu-node-name> | grep ""
If you don’t see any
labels on your GPU node, it means the operator isn’t working, and you should debug it. It is typically caused by taints or tolerations. Pay attention that the nvidia-device-plugin
pod should exist on each GPU node.
Verify the GPU Operator
Check the status of operator pods:kubectl get pods -n gpu-operator
If some pods are stuck in Pending or CrashLoopBackOff, restart the operator:kubectl delete pod -n gpu-operator — all
Restart kubectl
Sometimes, the kubelet gets stuck. In such cases, login into a node and restarting Kubelet may be helpful.
**Scale Up GPU Nodes
**Increase GPU node count:eksctl scale nodegroup --cluster=kvendingoldo–eks-gpu-demo --name=gpu-nodes --nodes=3
Congrats! Your EKS cluster is all set to tackle GPU workloads. Whether you’re running AI models, processing videos, or crunching data, you’re ready to go. Happy deploying!