AWS EKS allows you to host Windows and Linux containers in the same cluster. Linux nodes however are much faster to start with nodes only taking 30s to become available to schedule your pods compared to the 5 or so minutes it takes to start a Windows node on a m5a.xlarge (4 cores, 16GB RAM). I’ll share some tricks I used to start Windows nodes in less than 90 s with Karpenter.
Anyone looking around for ways to improve the startup performance of EKS Windows nodes will have crossed this article on AWS’ Blog:
Broadly speaking there are 2 options advised to improve the slow boot issue
A common use case for EKS Windows Nodes is hosting legacy .NET Framework applications under IIS. The problem is that IIS in a container takes a very long time to download (10m) making autoscaling a no-go. Essentially, pre-caching base images nullifies the time taken to ‘download IIS’ and so only your application files are downloaded at node startup, the IIS layers that make up your docker images are already baked into the AMI.
Before even running a pod on this node, the node needs to boot. AWS Image Builder sysprep Windows images produce AMIs and so the first time a new Windows EC2 instance comes about, it has to go about a long initialization process.
Every Amazon EC2 Windows instance must go through the standard Windows operating system (OS) launch steps, which include several reboots, and often take 15 minutes or longer to complete. Amazon EC2 Windows Server AMIs that have the Windows fast launch feature enabled complete some of those steps and reboots in advance to reduce the time it takes to launch an instance.
By enabling fast launch, we can see freshly started EC2 instances take around ~50–60s to ‘reach the lock screen’ on Windows Server 2022 m5a.xlarge.
Even with these strategies employed, it still takes 5 minutes to start a m5a.xlarge Windows Node and join the cluster.
When Karpenter requests a new Windows 2022 EC2 instance, it utilizes the EKS-optimized AMIs by AWS. This AMI contains a couple of Powershell scripts to bootstrap the Kubernetes components.
Start-EKSBootstrap.ps1
EKS-StartupTask.ps1
Digging around, it appears there are 3 parts to the EKS Windows node boot process:
Start the EC2 Instance (50s with fast launch enabled)
Start-EKSBootstrap.ps1
(4–6 mins depending on instance size)
EKS-StartupTask.ps1
(40 sec)
So the longest task is the bootstrap script. This script creates config files for kube-proxy, kubelet, CNI plugin, registers the kubelet and kube-proxy with Windows Service Control Manager (SCM), then triggers EKS-StartupTask.ps1 via a scheduled task, which then goes on to setup container networking and start the Kubernetes services.
Looking through the Start-EKSBootstrap.ps1 file(4–6 min runtime) , it appears time is mostly wasted loading the large AWSPowershell
module which is required by the script to get EC2 instance metadata and EKS cluster information. Investigating the impact of antivirus on this script showed that Windows Defender was slowing down the load of this module.
The first thing I determined I could do was completely uninstall the AWSPowershell
module from the EC2 instance.
I looked through the scripts and determined which PowerShell cmdlets were in use to narrow down the modules needed by the bootstrap process:
The effect of removing the AWSPowershell
module is that the bootstrap script no longer needs to load every functionality offered by AWS via Powershell just to get a few bits of information regarding EC2 and EKS.
This could easily be slotted in because we are using Image Builder to pre-cache container images. I created a new image builder component with the content:
name: OptimizeEKSNodeModules
description: OptimizeEKSNodeModules
schemaVersion: 1.0
phases:
- name: build
steps:
- name: OptimizeEKSNodeModules
action: ExecutePowerShell
inputs:
commands:
- Uninstall-Module AWSPowershell
- Install-PackageProvider -Name NuGet -MinimumVersion 2.8.5.201 -Force
- Install-Module -Name Aws.Tools.Eks -Force -AllowClobber -Scope AllUsers
- Install-Module -Name Aws.Tools.Ec2 -Force -AllowClobber -Scope AllUsers
This had a good impact on boot time. We saw boot time drop to about ~3–4 mins.
This could still be improved…
However, the Powershell scripts had limited avenues for optimization. In the end, I decided to rewrite the bootstrap script using C#/.NET where I had a bit more control.
The above repo contains the code for the C# version of Start-EKSBootstrap.ps1
. It parallelizes as much as possible and is AOT (Ahead-of-time) compiled to native code in the Github action to reduce the startup latency of the release binary. It also removes all the checks that made the script idempotent, as this is unnecessary under Karpenter and just slows down the boot.
To use it, you can add another component to your image builder recipe:
name: InstallEksWindowsBootstrapper
description: InstallEksWindowsBootstrapper
schemaVersion: 1.0
phases:
- name: build
steps:
- name: InstallEksWindowsBootstrapper
action: ExecutePowerShell
inputs:
commands:
- |
Invoke-WebRequest "https://github.com/atg-cloudops/eks-windows-bootstrapper/releases/download/v1.29.0/EKS-Windows-Bootstrapper.exe" -OutFile "C:/EKS-Windows-Bootstrapper.exe"
Invoke-WebRequest "https://github.com/atg-cloudops/eks-windows-bootstrapper/releases/download/v1.29.0/Start-EKSBootstrap.ps1" -OutFile "C:/Start-EKSBootstrap.ps1"
Move-Item -Path "C:/EKS-Windows-Bootstrapper.exe" -Destination "C:/Program Files/Amazon/EKS/EKS-Windows-Bootstrapper.exe" -Force -Verbose
Move-Item -Path "C:/Start-EKSBootstrap.ps1" -Destination "C:/Program Files/Amazon/EKS/Start-EKSBootstrap.ps1" -Force -Verbose
The above component downloads the 1.29.0 release of the bootstrap app.
(I am using Kubernetes 1.29 so that’s what the code is targeting)
It replaces the existing file in C:/Program Files/Amazon/EKS/Start-EKSBootstrap.ps1
with another ps1 file which simply calls the exe and puts the exe next to the ps1 file. When the node boots, the new code runs instead.
In the end I measured the new code to take less than 400ms to configure a node — typically taking around 350ms to complete. It still finishes by starting the scheduled task to call EKS-StartupTask.ps1
Booting this image takes around 93s on an m5a.xlarge and around 77s on m5a.2xlarge. Uninstalling Windows Defender in a separate test to see the impact on boot performance showed the nodes starting in about 60s!
If we wish to improve boot times further, we must find a way to cut down the time it takes for Windows to finish the ‘OOBE’ process more than EC2 fast launch allows. We can also shave off around 40s in EKS-StartupTask.ps1
.
If anyone reading this has any ideas on how we can achieve this, would be great to hear!