Running GPU workloads on Amazon EKS requires configuring GPU-enabled nodes, installing necessary drivers, and ensuring proper scheduling. Follow these steps to set up GPU nodes in your EKS cluster. 1. Create an Amazon EKS Cluster First, create an EKS cluster without worker nodes using eksctl (for simplicity, we don’t use Terraform/OpenTofu ): eksctl eksctl create cluster --name kvendingoldo–eks-gpu-demo --without-nodegroup eksctl create cluster --name kvendingoldo–eks-gpu-demo --without-nodegroup 2. Create a Default CPU Node Group A separate CPU node group ensures that Kubernetes system components (kube-system pods) have a place to run.


The GPU Operator and its dependencies will be deployed successfully.


Non-GPU workloads don’t end up on GPU nodes. Kubernetes system components (kube-system pods) have a place to run. Kubernetes system components (kube-system pods) have a place to run. kube-system The GPU Operator and its dependencies will be deployed successfully. The GPU Operator and its dependencies will be deployed successfully. Non-GPU workloads don’t end up on GPU nodes. Non-GPU workloads don’t end up on GPU nodes. Create at least one CPU node to maintain cluster stability: CPU node eksctl create nodegroup --cluster kvendingoldo–eks-gpu-demo \
 --name cpu-nodes \
 --node-type t3.medium \
 --nodes 1 \
 --nodes-min 1 \
 --nodes-max 3 \
 --managed eksctl create nodegroup --cluster kvendingoldo–eks-gpu-demo \
 --name cpu-nodes \
 --node-type t3.medium \
 --nodes 1 \
 --nodes-min 1 \
 --nodes-max 3 \
 --managed 3. Create a GPU Node Group GPU nodes should have appropriate taints to prevent non-GPU workloads from running on them. Use an NVIDIA-compatible instance type (you can check all options at instances.vantage.sh, but typically, it’s g4dn.xlarge or p3.2xlarge) for such nodes: instances.vantage.sh g4dn.xlarge p3.2xlarge eksctl create nodegroup --cluster kvendingoldo–eks-gpu-demo \
 --name gpu-nodes \
 --node-type g4dn.xlarge \
 --nodes 1 \
 --node-taints only-gpu-workloads=true:NoSchedule \
 --managed eksctl create nodegroup --cluster kvendingoldo–eks-gpu-demo \
 --name gpu-nodes \
 --node-type g4dn.xlarge \
 --nodes 1 \
 --node-taints only-gpu-workloads=true:NoSchedule \
 --managed A custom taint only-gpu-workloads=true:NoSchedule guarantees that only pods with the same toleration configuration are scheduled on these nodes. only-gpu-workloads=true:NoSchedule 4. Install the NVIDIA GPU Operator The NVIDIA GPU Operator installs drivers, CUDA, toolkit, and monitoring tools. To install it, use the following steps: Create gpu-operator-values.yaml: Create gpu-operator-values.yaml: tolerations:
- key: "only-gpu-workloads"
  value: "true"
  effect: "NoSchedule" tolerations:
- key: "only-gpu-workloads"
  value: "true"
  effect: "NoSchedule" Deploy the gpu-operator via Helm: Deploy the gpu-operator via Helm: helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator -f gpu-operator-values.yaml helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator -f gpu-operator-values.yaml Pay attention to two things: YAML deployment of k8s-device-plugin shouldn’t be used for production.
By using gpu-operator-values.yaml values, we set up tolerations for the gpu-operator daemonset; without that, nodes will not work, and you won’t be able to schedule GPU workloads there. YAML deployment of k8s-device-plugin shouldn’t be used for production. YAML deployment of k8s-device-plugin YAML deployment of k8s-device-plugin By using gpu-operator-values.yaml values, we set up tolerations for the gpu-operator daemonset; without that, nodes will not work, and you won’t be able to schedule GPU workloads there. gpu-operator-values.yaml 5. Verify GPU Availability After deploying the GPU Operator, check if NVIDIA devices are correctly detected on the GPU by the following command: kubectl get nodes -o json | jq '.items[].status.allocatable' | grep nvidia kubectl get nodes -o json | jq '.items[].status.allocatable' | grep nvidia Check the GPU Status on the Node Using AWS SSM (In Case of Issues) If you need to manually debug a GPU node, connect using AWS SSM (Systems Manager Session Manager) instead of SSH. AWS SSM (Systems Manager Session Manager) Step 1: Attach the SSM IAM Policy Ensure your EKS worker nodes have the AmazonSSMManagedInstanceCore policy: EKS worker nodes AmazonSSMManagedInstanceCore aws iam attach-role-policy --role-name <NodeInstanceRole> \
 --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore aws iam attach-role-policy --role-name <NodeInstanceRole> \
 --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore Step 2: Start an SSM Session Find the Instance ID of your GPU node: Instance ID aws ec2 describe-instances --filters "Name=tag:eks:nodegroup-name,Values=gpu-nodes" \
 --query "Reservations[].Instances[].InstanceId" --output text aws ec2 describe-instances --filters "Name=tag:eks:nodegroup-name,Values=gpu-nodes" \
 --query "Reservations[].Instances[].InstanceId" --output text Start the AWS SSM session: aws ssm start-session --target <Instance-ID> aws ssm start-session --target <Instance-ID> Inside the node, check the GPU state: lspci | grep -i nvidia to check if the GPU hardware is detected
nvidia-smi to verify the NVIDIA driver and GPU status lspci | grep -i nvidia to check if the GPU hardware is detected lspci | grep -i nvidia nvidia-smi to verify the NVIDIA driver and GPU status nvidia-smi If Nvidia-smi fails or the GPU is missing, it may indicate that: Nvidia-smi fails GPU Operator is not installed correctly.
K8S node does not have an NVIDIA GPU.
NVIDIA driver failed to load. GPU Operator is not installed correctly. GPU Operator is not installed K8S node does not have an NVIDIA GPU. K8S node does not have an NVIDIA GPU NVIDIA driver failed to load. NVIDIA driver failed to load Check the official Nvidia documentation to solve these issues. the official Nvidia documentation the official Nvidia documentation 6. Schedule a GPU Pod Deploy a test pod to verify GPU scheduling. This pod: Requests a GPU.
Uses tolerations to run on GPU nodes.
Runs nvidia-smi to confirm GPU access. Requests a GPU. Uses tolerations to run on GPU nodes. tolerations Runs nvidia-smi to confirm GPU access. ---
apiVersion: v1
kind: Pod
metadata:
  name: kvendingoldo-gpu-test
spec:
  tolerations:
  - key: "only-gpu-workloads"
    value: "true"
    effect: "NoSchedule"
  nodeSelector:
    nvidia.com/gpu: "true"
  containers:
    - name: cuda-container
      image: nvidia/cuda:12.0.8-base
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1 ---
apiVersion: v1
kind: Pod
metadata:
  name: kvendingoldo-gpu-test
spec:
  tolerations:
  - key: "only-gpu-workloads"
    value: "true"
    effect: "NoSchedule"
  nodeSelector:
    nvidia.com/gpu: "true"
  containers:
    - name: cuda-container
      image: nvidia/cuda:12.0.8-base
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1 7. Handling “Insufficient nvidia.com/gpu” Errors Typically, users may face failing pods and errors like: 0/2 nodes are available: 1 Insufficient nvidia.com/gpu 0/2 nodes are available: 1 Insufficient nvidia.com/gpu It means that all GPUs are already allocated or Kubernetes is not recognizing available GPUs. Here, the following fixes may help. all GPUs are already allocated Check GPU Allocations kubectl describe node <gpu-node-name> | grep "nvidia.com/gpu" kubectl describe node <gpu-node-name> | grep "nvidia.com/gpu" If you don’t see any nvidia.com labels on your GPU node, it means the operator isn’t working, and you should debug it. It is typically caused by taints or tolerations. Pay attention that the nvidia-device-plugin pod should exist on each GPU node. nvidia.com nvidia-device-plugin Verify the GPU Operator Verify the GPU Operator Check the status of operator pods:kubectl get pods -n gpu-operator kubectl get pods -n gpu-operator If some pods are stuck in Pending or CrashLoopBackOff, restart the operator:kubectl delete pod -n gpu-operator — all kubectl delete pod -n gpu-operator — all Restart kubectl Restart kubectl Sometimes, the kubelet gets stuck. In such cases, login into a node and restarting Kubelet may be helpful. **Scale Up GPU Nodes**Increase GPU node count:eksctl scale nodegroup --cluster=kvendingoldo–eks-gpu-demo --name=gpu-nodes --nodes=3 eksctl scale nodegroup --cluster=kvendingoldo–eks-gpu-demo --name=gpu-nodes --nodes=3 Conclusion Congrats! Your EKS cluster is all set to tackle GPU workloads. Whether you’re running AI models, processing videos, or crunching data, you’re ready to go. Happy deploying!

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

Terra

Amazon

Cluster

Default

effect

Helm

NVIDIA

scale

Session

Status

How to Create Loki Alerts Via PrometheusRule Resource

Secrets Are No Fun—Unless You Compare Them in Kubernetes

Read My Stories

Hire Me

Buy Me a Coffee

Co-Founder

Too Long; Didn't Read

A Guide on How to Use GPU Nodes in Amazon EKS

A Guide on How to Use GPU Nodes in Amazon EKS

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AWS ElasticBeanStalk Custom Metrics Configuration Based on RAM Metrics Example

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

AWS ElasticBeanStalk Custom Metrics Configuration Based on RAM Metrics Example

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps