How to Optimize Kubernetes for Large Docker Images

A brief overview of the problem

One day, during a planned update of the k8s cluster, we discovered that almost all of our PODs (approximately 500 out of 1,000) on new nodes were unable to start, and the minutes quickly turned into hours. We are actively searched for the root cause, but after three hours, the PODS were still in the ContainerCreating status.

Thankfully, this was not the prod environment and the maintenance window was scheduled for the weekend. We had time to investigate the issue without any pressure.

Where should you begin your search for the root cause? Would you like to learn more about the solution we found? Strap in and enjoy!

An original CI/CD design

The main idea of the solution is to warm up the nodes before the CD process starts by the largest part of the docker image (JS dependencies layer), which uses as the root image for all our apps. We have at least 7 types of the root images with the JS dependencies, which are related to the type of the app. So, let’s analyze the original CI/CD design.

In our CI/CD pipeline, we have 3 pillars:

An original CI/CD pipeline:

On the Init it step: we prepare the environment/variables, define the set of images to rebuild, etc...
On the Build step: we build the images and push them to the ECR
On the Deploy step: we deploy the images to the k8s (update deployments, etc...)

More details about the original CICD design:

Our feature branches (FB) forked from the main branch. In the CI process, we always analyze the set of images which were changed in the FB and rebuild them. The main branch is always stable, as the definition, there is should be always the latest version of the base images.
We separately build the JS dependencies docker images (for each environment) and push it to the ECR to reuse it as the root(base) image in the Dockerfile. We have around 5–10 types of the JS dependencies docker image.
The FB is deployed to the k8s cluster to the separate namespace, but to the common nodes for the FB. The FB can have ~200 apps, with the image size up to 3 GB.
We have the cluster autoscaling system, which scales the nodes in the cluster based on the load or pending PODS with the according nodeSelector and toleration.
We use the spot instances for the nodes.

Implementation of the warm-up process

There are requirements for the warm-up process.

Mandatory:

Issue Resolution: Addresses and resolves ContainerCreating issues.
Improved Performance: Significantly reduces startup time by utilizing preheated base images(JS dependencies).

Nice to have improvements:

Flexibility: Allows for easy changes to the node type and its lifespan (e.g., high SLA or extended time to live).
Transparency: Provides clear metrics on usage and performance.
Cost Efficiency: Saves costs by deleting the VNG immediately after the associated feature branch is deleted.
Isolation: This approach ensures that other environments are not affected.

Solution

After analyzing requirements and constraints, we decided to implement a warm-up process that would preheat the nodes with the base JS cache images. This process would be triggered before the CD process starts, ensuring that the nodes are ready for the deployment of the FB, and we have a max chance to hit the cache.

This improvement we split into tree big steps:

Create the set of the nodes (Virtual Node Group) per each FB
Add base images to the cloud-init script for the new nodes
Add a pre-deploy step to run the DaemonSet with the initContainers section to download the necessary docker images to the nodes before the CD process starts.

An updated CI/CD pipeline would look like this:

An updated CI/CD pipeline:

Initstep
1.1.(new step)Init deploy: If it’s a first start of the FB, then create a new personal set of the node instances (in our terms it’s Virtual Node Group or VNG) and download all JS base images (5–10 images) from the main branch. It’s fair enough to do it, because we forked the FB from the main branch. An important point, it’s not a blocking operation.
Build step
Pre deploystep Download fresh baked JS base images with the specific FB tag from the ECR.
3.1.(new step)Important points: It’s a blocking operation, because we should reduce the disk pressure. One by one, we download the base images for each related node.
Btw, thanks for the “init deploy” step, we already have the base docker images from the main branch, that is give us a big chance to hit the cache on the first start.
**Deploy
**There are no changes in this step. But thanks to the previous step, we already have all heavy docker image layers on the necessary nodes.

Init deploy step

Create a new set of the nodes for each FB via API call (to the 3rd party autoscaling system) from our CI pipeline.

Resolved issues:

Isolation: Each FB has its own set of nodes, ensuring that the environment is not affected by other FBs.
Flexibility: We can easily change the node type and its lifespan.
Cost Efficiency: We can delete the nodes immediately after the FB is deleted.
Transparency: We can easily track the usage and performance of the nodes (each node has a tag related to the FB).
Effective usage of the spot instances: The spot instance is starting with already predefined base images, that means, after the spot node starts, there already the base images on the node (from the main branch).

Download all JS base images from the main branch to the new nodes via cloud-initscript.

While the images are being downloaded in the background, the CD process can continue to build new images without any issues. Moreover, the next nodes (which will be created by autoscaling system) from this group will be created with the updatedcloud-init data, which already have instructions to download images before start.

Resolved issues:

Issue Resolution: Disk pressure is gone, because we updated the cloud-init script by adding the download of the base images from the main branch. This allows us to hit the cache on the first start of the FB.
Effective usage of the spot instances: The spot instance is starting with updated cloud-init data. It means, that after the spot node starts, there already the base images on the node (from the main branch).
Improved Performance: The CD process can continue to build new images without any issues.

This action added ~17 seconds(API call) to our CI/CD pipeline.

This action makes sense only the first time when we start the FB. Next time, we deploy our apps to already existing nodes, which already have the base images, which we delivered on the previous deployment.

Pre deploy step

We need this step, because the FB images are different from the main branch images. We need to download the FB base images to the nodes before the CD process starts. This will help to mitigate the extended cold start times and high disk utilization that can occur when multiple heavy images are pulled simultaneously.

Objectives of the Pre-Deploy Step

Prevent Disk Pressure: Sequentially download docker most heavy images. After the init-deploy step, we already have the base images on the nodes, which means we have a big chance to the hit cache.
Improve Deployment Efficiency: Ensure nodes are preheated with essential docker images, leading to faster(almost immediately) POD startup times.
Enhance Stability: Minimize the chances of encountering ErrImagePull/ContainerCreating errors and ensure system daemon sets remain in a “ready” state.

Under this step, we add 10–15 minutes to the CD process.

Pre deploy step Details:

In the CD we create a DaemonSet with the initContainers section.
The initContainers section is executed before the main container starts, ensuring that the necessary images are downloaded before the main container starts.
In the CD we are continuously checking the status of the daemonSet. If the daemonSet is in a “ready” state, we proceed with the deployment. Otherwise, we wait for the daemonSet to be ready.

Comparison

Comparison of the original and updated steps with the preheat process.

Step	Init deploy step	Pre deploy step	Deploy	Total time	Diff
Without preheat	0	0	11m 21s	11m 21s	0
With preheat	8 seconds	58 seconds	25 seconds	1m 31s	-9m 50s

The main thing, the “Deploy” time changed (from the first apply command to the Running state of the pods) from 11m 21s to 25 seconds. The total time changed from 11m 21s to 1m 31s.

An important point, if there are no base images from the main branch, then the “Deploy” time will be the same as the original time or a little bit more. But anyway, we resolved an issue with the disk pressure and the cold start time.

Conclusion

The main issue ContainerCreatingwas resolved by the warm-up process. As a benefit, we significantly reduced the cold start time of the PODs.
The disk pressure was gone, because we already have the base images on the nodes. The system daemonSets are in a "ready" and “healthy”state (because there is no disk pressure), and we have not encountered anyErrImagePull errors related to this issue.

Possible solutions and links

Use on demand instances for the nodes instead of the spot instances
We can’t use this way, because it’s out of the scope of our budget for non-production environments.
Use the Amazon EBS gp3(or better) volume type with the increased IOPS
We can’t use this way, because this feature also outs for the scope of our budget for non-production environments. Moreover, AWS has thelimits of th IOPS for your account per region.
Reduce container startup time on Amazon EKS with Bottlerocket data volume
Actually we can’t move this way, because it’s too high impact on the production and other environments, but it’s also a good solution to our issue.
Troubleshoot Kubernetes Cluster Autoscaler takes 1 hr to scale 600 pods up

P.S: I’d like to give a shout-out to the great technical team at Justt (https://www.linkedin.com/company/justt-ai)for the tireless work and really creative approach to any issue they are faced with. In particular, a shout-out to Ronny Sharaby, the superb lead who is responsible for the great work that the team is doing. I am looking forward to seeing more and more great examples of how your creativity impacts the Justt product.

How to Optimize Kubernetes for Large Docker Images

Too Long; Didn't Read

A brief overview of the problem

More details about the problem

Issues Faced

Hotfix to recovery the cluster

An original CI/CD design

Implementation of the warm-up process

Solution

An updated CI/CD pipeline would look like this:

An updated CI/CD pipeline:

Init deploy step

Pre deploy step

Comparison

Conclusion

Possible solutions and links

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

How to Optimize Kubernetes for Large Docker Images

Too Long; Didn't Read

A brief overview of the problem

More details about the problem

Issues Faced

Hotfix to recovery the cluster

An original CI/CD design

Implementation of the warm-up process

Solution

An updated CI/CD pipeline:

Init deploy step

Pre deploy step

Comparison

Possible solutions and links

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps