paint-brush
How I Reduced EKS Windows Node Start Time From 5 Min to ~90s by@x86potato

How I Reduced EKS Windows Node Start Time From 5 Min to ~90s

by Omorr FarukJune 8th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Learn how to accelerate AWS EKS Windows node startup times from 5 minutes to under 90 seconds using Karpenter, optimized PowerShell scripts, and pre-cached images for faster deployment and improved performance.
featured image - How I Reduced EKS Windows Node Start Time From 5 Min to ~90s
Omorr Faruk HackerNoon profile picture


AWS EKS allows you to host Windows and Linux containers in the same cluster. Linux nodes however are much faster to start with nodes only taking 30s to become available to schedule your pods compared to the 5 or so minutes it takes to start a Windows node on a m5a.xlarge (4 cores, 16GB RAM). I’ll share some tricks I used to start Windows nodes in less than 90 s with Karpenter.

Background

Anyone looking around for ways to improve the startup performance of EKS Windows nodes will have crossed this article on AWS’ Blog:


Broadly speaking there are 2 options advised to improve the slow boot issue


Pre-cache base images using AWS Image Builder

A common use case for EKS Windows Nodes is hosting legacy .NET Framework applications under IIS. The problem is that IIS in a container takes a very long time to download (10m) making autoscaling a no-go. Essentially, pre-caching base images nullifies the time taken to ‘download IIS’ and so only your application files are downloaded at node startup, the IIS layers that make up your docker images are already baked into the AMI.

Configure the images produced by Image Builder to use ‘Fast Launch’

Before even running a pod on this node, the node needs to boot. AWS Image Builder sysprep Windows images produce AMIs and so the first time a new Windows EC2 instance comes about, it has to go about a long initialization process.


Every Amazon EC2 Windows instance must go through the standard Windows operating system (OS) launch steps, which include several reboots, and often take 15 minutes or longer to complete. Amazon EC2 Windows Server AMIs that have the Windows fast launch feature enabled complete some of those steps and reboots in advance to reduce the time it takes to launch an instance.

source.


By enabling fast launch, we can see freshly started EC2 instances take around ~50–60s to ‘reach the lock screen’ on Windows Server 2022 m5a.xlarge.


Even with these strategies employed, it still takes 5 minutes to start a m5a.xlarge Windows Node and join the cluster.



Where is time spent starting up?

When Karpenter requests a new Windows 2022 EC2 instance, it utilizes the EKS-optimized AMIs by AWS. This AMI contains a couple of Powershell scripts to bootstrap the Kubernetes components.

  • Start-EKSBootstrap.ps1
  • EKS-StartupTask.ps1


Digging around, it appears there are 3 parts to the EKS Windows node boot process:

  • Start the EC2 Instance (50s with fast launch enabled)

  • Start-EKSBootstrap.ps1 (4–6 mins depending on instance size)

  • EKS-StartupTask.ps1 (40 sec)


So the longest task is the bootstrap script. This script creates config files for kube-proxy, kubelet, CNI plugin, registers the kubelet and kube-proxy with Windows Service Control Manager (SCM), then triggers EKS-StartupTask.ps1 via a scheduled task, which then goes on to setup container networking and start the Kubernetes services.


Looking through the Start-EKSBootstrap.ps1 file(4–6 min runtime) , it appears time is mostly wasted loading the large AWSPowershell module which is required by the script to get EC2 instance metadata and EKS cluster information. Investigating the impact of antivirus on this script showed that Windows Defender was slowing down the load of this module.

How do we make it better?

The first thing I determined I could do was completely uninstall the AWSPowershell module from the EC2 instance.

I looked through the scripts and determined which PowerShell cmdlets were in use to narrow down the modules needed by the bootstrap process:

  • AWS.Tools.EC2
  • AWS.Tools.EKS


The effect of removing the AWSPowershell module is that the bootstrap script no longer needs to load every functionality offered by AWS via Powershell just to get a few bits of information regarding EC2 and EKS.


This could easily be slotted in because we are using Image Builder to pre-cache container images. I created a new image builder component with the content:

name: OptimizeEKSNodeModules
description: OptimizeEKSNodeModules
schemaVersion: 1.0

phases:
  - name: build
    steps:
      - name: OptimizeEKSNodeModules
        action: ExecutePowerShell
        inputs:
          commands:
            - Uninstall-Module AWSPowershell
            - Install-PackageProvider -Name NuGet -MinimumVersion 2.8.5.201 -Force
            - Install-Module -Name Aws.Tools.Eks -Force -AllowClobber -Scope AllUsers
            - Install-Module -Name Aws.Tools.Ec2 -Force -AllowClobber -Scope AllUsers

This had a good impact on boot time. We saw boot time drop to about ~3–4 mins.


This could still be improved…


However, the Powershell scripts had limited avenues for optimization. In the end, I decided to rewrite the bootstrap script using C#/.NET where I had a bit more control.


The above repo contains the code for the C# version of Start-EKSBootstrap.ps1. It parallelizes as much as possible and is AOT (Ahead-of-time) compiled to native code in the Github action to reduce the startup latency of the release binary. It also removes all the checks that made the script idempotent, as this is unnecessary under Karpenter and just slows down the boot.


To use it, you can add another component to your image builder recipe:

name: InstallEksWindowsBootstrapper
description: InstallEksWindowsBootstrapper
schemaVersion: 1.0

phases:
  - name: build
    steps:
      - name: InstallEksWindowsBootstrapper
        action: ExecutePowerShell
        inputs:
          commands:
            - |
              Invoke-WebRequest "https://github.com/atg-cloudops/eks-windows-bootstrapper/releases/download/v1.29.0/EKS-Windows-Bootstrapper.exe" -OutFile "C:/EKS-Windows-Bootstrapper.exe"
              Invoke-WebRequest "https://github.com/atg-cloudops/eks-windows-bootstrapper/releases/download/v1.29.0/Start-EKSBootstrap.ps1" -OutFile "C:/Start-EKSBootstrap.ps1"

              Move-Item -Path "C:/EKS-Windows-Bootstrapper.exe" -Destination "C:/Program Files/Amazon/EKS/EKS-Windows-Bootstrapper.exe" -Force -Verbose
              Move-Item -Path "C:/Start-EKSBootstrap.ps1" -Destination "C:/Program Files/Amazon/EKS/Start-EKSBootstrap.ps1" -Force -Verbose

The above component downloads the 1.29.0 release of the bootstrap app.

(I am using Kubernetes 1.29 so that’s what the code is targeting)

It replaces the existing file in C:/Program Files/Amazon/EKS/Start-EKSBootstrap.ps1 with another ps1 file which simply calls the exe and puts the exe next to the ps1 file. When the node boots, the new code runs instead.

In the end I measured the new code to take less than 400ms to configure a node — typically taking around 350ms to complete. It still finishes by starting the scheduled task to call EKS-StartupTask.ps1



The Final Result

Booting this image takes around 93s on an m5a.xlarge and around 77s on m5a.2xlarge. Uninstalling Windows Defender in a separate test to see the impact on boot performance showed the nodes starting in about 60s!


If we wish to improve boot times further, we must find a way to cut down the time it takes for Windows to finish the ‘OOBE’ process more than EC2 fast launch allows. We can also shave off around 40s in EKS-StartupTask.ps1 .


If anyone reading this has any ideas on how we can achieve this, would be great to hear!