5,265 reads

Surviving a Skyfall: A Lead Engineer’s True Story

by MikhailJuly 11th, 2023

Too Long; Didn't Read

In June 2021 I was hunted down by a big company that grew out of a mobile telecom provider. The company acquired the market’s oldest virtualization provider for over 30 million euros. The next step was to gather all cloud assets on a new platform built into a ‘unified provider’ concept.

featured image - Surviving a Skyfall: A Lead Engineer’s True Story

Sometimes taking a new job means taking a lot of unexpected trouble with it. That happened to me once. What promised to be tough, but mostly hassle-free work nearly turned out to be a Mission Impossible task. The good news is that even humble humans can sometimes hold the sky from falling in an Atlas fashion. I wanted to share my story to show that ultimately no task is impossible. You don’t have to be James Bond himself: if I did it, you can probably do it as well.

“So, 007... Lots to be done”

Like many good thriller movies, this episode of my career started peacefully, without any suspense. In June 2021 I was hunted down by a big company that grew out of a mobile telecom provider. With time it turned into a nationwide tech major by building a diverse digital ecosystem.

One of the ecosystem’s pillars was enterprise cloud services. The company has already had a strong IaaS and data center arm for a while. Additionally, about two years before my arrival it acquired the market’s oldest virtualization provider for over 30 million euros. The next step was to gather all cloud assets on a new platform built into a ‘unified provider’ concept. The vision was that any business could get all imaginable cloud services at a universal hub.

The new platform was being built by a remarkable team, led by the same guy who hired me (sticking to the 007 tradition, let’s call him M). He was an avid promoter of horizontal dev structures and ‘bubble escalation’ principles. That basically meant that in his team there were no bosses and all issues were supposed to be resolved by consensus.

If a problem was complex enough to expand beyond one product or team cell, owners of the products or teams involved were engaged. If no effective solution was found, the problem would be escalated until a common link for the conflicting products was found in the management chain.

So the team was good and progressive, the company was big and getting even bigger. All of that promised a comfortable experience with decent perspectives. My role as a Lead Engineer was to identify services that were unique to the recently acquired provider and lead their migration to the new hyper scaler cloud hub. The task was big and challenging but looked quite manageable.

Well, at least it did until hell broke loose…

My onboarding paperwork was not completed yet and I didn’t even visit the quartiermeister to claim my duty PPK, oh, I mean my corporate laptop, when M summoned his entire team for an emergency meeting. He had bad news and good news. The bad news was that the legacy platform’s dev lead was leaving the company all of a sudden. And so were the majority of his backend and frontend developers. We had just two or three weeks to transfer the entire knowledge about the platform to our team. The good news was that this task was assigned to me. Or was it the good one?

“The whole office goes up in smoke and that bloody thing survives”

Well, that bloody thing had to survive, no options. The legacy platform generated a hefty revenue servicing over 2,500 paying corporate users and operated half a dozen data centers. I had to keep that system working, hastily hire, onboard and train the missing personnel AND also plan and execute the initial migration task. What assets did I have on my side? Well, I finally had my laptop. Also, I had all that was left of the legacy dev team: a junior developer, just recently promoted and retrained from a tester role; a brand new outsource frontend developer; and yet another outsource developer, who has been working on the project for four years and later on proved to be one of our saviors. Luckily, the turmoil did not affect the support, virtualization, and network teams.

My biggest nightmare was losing the knowledge and experience accumulated by the legacy team over the last eight years. This could obviously result in downtime, hamper inefficiency, scare off clients and dry out the cash flow. We wobbled on the edge of disaster for about four months.

By the end of summer, the product owner was gone and we had a new one that previously did marketing. As if the situation was not difficult enough, we were tasked with supporting the manual sales billing system. Being critically understaffed, we now had to have a dedicated specialist to do this job. Finally, in autumn, after countless interviews, we managed to forge a new team hiring three dev ops specialists and five backend developers.

Just as the new team got assembled, there came the first battle trial: our platform came under attack. It took us one sleepless night to figure out what was going on. Code errors, server overload, and malware were ruled out one by one, but the problem persisted. Finally, we discovered a plethora of unfinished IIS network connections to our servers, which was an indication of a DDoS attack.

The solution was simple: we just put the platform under the cloud-based protection service, which was actually one of our own products. That sentinel still stands, although it is not perfect. Once, as we altered the platform’s framework and server responses, it mistook our entire traffic for DDoS activity and tried to block it. A single call to tech support was enough though. Counters were reset, cache was cleared and the problem was solved.

“- Well, everybody needs a hobby.

- So what's yours?

- Resurrection.”

Dealing with legacy systems often means mastering and replacing outdated technology. This was exactly the case. The eight-year-old platform was heavily relying on older versions of .NET and it was a pain to source experts on that technology to document all the platform’s key points like billing and server creation.

Luckily, I got help from two colleagues from the parent company. Then the platform was transferred bit by bit to a newer .NET framework version and service-oriented architecture. Next, the user password generation algorithm was far from modern cryptographic standards. We replaced it with a more robust one and recommended users to use SSH keys.

Also, the legacy platform had a monolithic architecture that became a bottleneck for performance and scalability. Tackling this issue, we broke it down into microservices with their own databases and caching, improving speed and flexibility.

Creating new client servers took about 15 minutes, which was way too much. To reduce this time to 3-5 minutes, we created a pool of preconfigured machines that could be modified to customer needs on the go. We modified deployment pipelines. We transitioned from IP to DNS-based infrastructure. We discovered a bunch of undocumented processes. Finally, we settled down on a new technical infrastructure by early 2022, after six months of 24/7 work.

Finally, what seemed like a bust in June 2021 turned out to be a success in 2022.

Despite being understaffed, outdated, and in continuous transition, the platform maintained a 99.96% SLA adherence. Our client base grew. Transition to a new framework reduced infrastructure costs by a whopping 50%. Our revenue increased by 126% year-on-year in 2021 and by a further 102% in 2022, amounting to about 2.8 million euros. What’s more, our platform was ranked second in the national IaaS Enterprise rating in 2021, our most difficult year. Two years later we were ranked #1.

“Hold your breath and count to ten”

The major lesson I learned from this experience is that you only grow when you do things you thought you were not qualified to do. Most miracles are done by holding your breath and doing the job.

Documentation is essential, especially for long-lived projects. But keep it reasonable: too much paperwork will be hard to maintain so parts of it will inevitably get outdated. It is crucial to document all business requirements, every bit of business logic, deployment sequences, and solution descriptions. That way you (or your successors) will be spared from the nightmare of reverse engineering an uncharted product in search of logic and reason.
Horizontal no-hierarchy teams and consensus-based decision-making can be an advantage. However, it is crucial for people to listen to their colleagues and communicate with them. Toxicity should be tabooed.
Task distribution is an art because you have to find a fine balance between specialization and a risk-posing ‘bus factor’.
You should always have a testing environment with all the underlying DevOps and reversible deployment infrastructure. Testing in near-real conditions will save you many hours and a lot of nerves, especially when it comes to integration and shifting to a newer framework version.
Keep connected with the business stakeholders. They deserve explanations of what is happening and why. Do your best to speak their language and, most importantly, to comprehend their goals.