We Moved 250 Microservices to Kubernetes With No Downtime

Hello, my name is August Vilakia

I have been working as a Senior Software Engineer at AlfaBank for more than a year, working on the backend for a mobile application. So why am I talking about moving microservices from one container orchestration platform to another?

In AlfaBank, any developer who wants to do something other than business tasks is allowed to spend 20% of his time on Infrastructure tasks.

Plan:

Why did we decide to move?
What problems we faced, and how we solved them?
How to organize parallel development and testing using Istio?

Our mobile application numbers are 5 million daily users and more than 250 microservices.

Our entire cluster was formed in Mesos. Mesos is a cluster manager that supports a different load. Foundation of our architecture, Mesos's masters and slaves orchestrated by Marathon. Almost the same as Kubernetes, the application runs in Kubernetes, but with slight differences in the cluster organization.

The choice of where to move to was obvious. Everyone heard about and moved to Kubernetes ages ago.

Reasons why we decided to move:

The last Mesos release was on December 12, 2019
Kubernetes became too popular and became an orchestration standard, making it easier to find developers who know something about k8s.
Walk away from Netflix Eureka, which was used as Service Discovery
Remove code unrelated to business logic (like tracing, etc.)
Service mesh needed a way to control how different parts of the application share data with one another

In addition, we had problems with the Mesos. One day all services started having issues with timeouts for no reason whatsoever.

We didn't know what to do!!!

Then we noticed that the CPU Load was really high on some hosts

Let's explain CPU load first. CPU load is the number of processes which are being executed by CPU or waiting to be executed by CPU. So CPU load average is the average number of processes being or waiting executed over past 1, 5 and 15 minutes.

It turned out that the old hosts were removed from the DC, and the new ones were put in, and it appeared that Mesos has no mechanism for rebalancing resources.

How the Kubernetes solve this?

It uses liveness, readiness probes + Scheduler to know the current state of the container.
Pod affinity and anti-affinity will organize the microservices by hosts.
HorizontalPodAutoscaler will check the current load and scale it. Descheduler - an additional module it removes pods from nodes using different configurable policies, and then Scheduler distributes pods.

The next problem was when we deployed the services, we caught a 503 error, even though all the health checks were passed. It's like someone removed routes. But, after 1-2 minutes, everything was okay. Maybe our config was too big or something was wrong with the EventBus that triggered updates.

What were the challenges of moving to Kubernetes?

Service dependency hell, all depended on each other and worked as one

So how to move without downtime? Make an additional cluster, move everything there and then just redirect traffic. We decided to split traffic using Headers and load balance traffic and if the service works fine, increase the load to k8s. After 100% of traffic is transferred to k8s, we disable service in Mesos.

Our clusters used different service discovery, so they couldn't find their feign clients in another cluster. We didn't want to have multiple branches in the repository that depends on the orchestrator. We also decided not to use specific dependencies because we would need to give access to the Kubernetes API.

We have Spring Starter with every feigns client, so we added their special bean with the condition when the application context is starting. This bean checks every client and adds a URL parameter if eureka is turned off or eureka is not in the path of the class.

We extract every Bean Definition by Factory Bean. In our case, it is ReactiveFeignClientFactoryBean. After that, we need to get the clientUrl parameter and, using String.format() change it according to Kubernetes. The problem is solved.

How to organize parallel development and testing using Istio?

The problems appear when there're a lot of teams developing and some teams changing similar functionality, so they create a shared branch, watching when another team is finished, and QA can't test properly too. Multiple dev/test stands are expensive, so what to do? Istio traffic management helped us. Istio is a service mesh used over business services to encrypt, trace, and canary deployment. Essentially every business service has an additional proxy service that intercepts traffic and does something with it. It uses Virtual service (which can be used to describe all the traffic properties of the corresponding hosts, including those for multiple HTTP and TCP ports), and it defines where to send traffic and Destination rules on how to split the traffic.

Everyone is happy, and everything works)). How it was and how it is, the top part is "before" and the bottom part of the picture is "after". Now our infrastructure stack is Kubernetes, Istio, and Fluentbit.