How We Migrated to the Cloud... of Ashes (Wallarm OVH Recovery in 2021)

Written by d0znpp | Published 2024/04/23
Tech Story Tags: disaster-recovery | wallarm | lessons-learned | ovh | cloud-migration | data-loss | recovering-your-data | ovh-data-center

TLDRThe OVH data center in Strasbourg, France, burned down on March 10, 2019. Wallarm, the company behind the Wallarm API and application firewall, was forced to migrate to the cloud. The company had three clouds; the one that burned was the oldest infrastructure-wise. The migration project is now complete.via the TL;DR App

DISCLAIMER: This story happened in 2021, but I finally found some time and emotional power to make it written. Please enjoy your reading!

Hi Ivan! I have some good news and bad news, as they say. It looks like we'll finish our migration to the cloud today.

It's been almost a year, and I've finally gathered the courage to write down the whole story, without embellishing or omitting any details. I hope this truth will help readers do better and learn from our mistakes.

7 PM in California is when Europe is still asleep, and our work here is done, allowing us some rest. That wasn't the case on March 10th. Honestly, I barely remember what March looked like after the 10th. And April too. Time seemed to speed up that spring and only slowed down by summer.

We migrated to a cloud of ash. The cloud migration project is now complete!

On March 10th (00:47 in France, March 9th evening in California), the OVH data center in Strasbourg burned down. Technically two data centers, if you could call them container units separate buildings. Before this, I only knew Strasbourg as a border city in France with a human rights court. That day, I also learned that our servers were there—and that they were burning.

At that point, I had these myths in my mind:

  1. The fire will be put out soon.

  2. Our product (firewall) will continue working; clients just won't be able to access their accounts.

  3. We have backups; we'll recover everything

None of these statements turned out to be entirely true. But to make it interesting and understandable, let's briefly introduce Wallarm:

  • We sell an API and application firewall that functions as an Ingress/Nginx/Envoy module, analyzes traffic, and stores the results (malicious requests) in the cloud.

  • Clients access the cloud to review UX and retrieve alerts via API to PagerDuty, Slack, etc.

  • We have three clouds; the one in Europe that burned at OVH was the oldest infrastructure-wise.

Now, let's address the myths one by one

Myth #1: The Fire Will Be Put Out Soon

The first message said that a fire had started in one of the rooms of SBG2, nothing alarming. But within a couple of hours, it was clear things were bad, and the data center was lost—burned down completely. The fire also spread to the neighboring container, SBG-1, and half of it burned down as well.

Here's a tweet from OVH's CEO Octave Klaba:

"We have a major incident on SBG2. The fire declared in the building. Firefighters were immediately on the scene but could not control the fire in SBG2. The whole site has been isolated which impacts all services in SGB1-4. We recommend activating your Disaster Recovery Plan."

Interestingly, I had lunch with Octave in San Francisco back in the summer of 2019, where he shared their big plans for expansion and going public. I should have given him our branded fire extinguisher then...

Myth #2: Our Product Will Continue Working

This was almost true, but not for all clients. We sell software that operates within client infrastructure and is independent of our clouds, one of which had burned down in France. However, this isn't entirely accurate because each new pod in the kube, upon startup, registers in the cloud via API. Without this registration, it doesn't start. Although we had tested scenarios where the cloud disconnects during an already running service, startup was not covered.

We were in a situation where, after some time, all Kubernetes clients in Europe would have stopped working. More so, their traffic would have halted until our module was removed from the ingress. Our software operates inline.

This problem was solved with a placeholder that outputs a JSON stating that all registrations for pods are accepted without checks. Later, we refactored this part to function properly, now the ingress starts even if the cloud is unavailable.

Myth #3: We'll Restore Everything from Backups

In short, the backups were in the same data center. When we opted for dedicated servers, we wanted them close for better internal connectivity. Or perhaps we just weren't attentive to where our servers were placed.

Either way, backups definitely should have been stored separately and in the cloud. And that's what we do now.

We had to restore using logs, old and partial backups from people's machines, by decompiling client rules.

Let me elaborate on our rules, called LOM (Local Training Set)—yes, we coined that ourselves. These rules don't reach the client in the same form as they are stored in the cloud; they are compiled into something like a decision tree for faster firewall operation. We needed to create a decompiler for these rules, which we did. We asked clients for their compiled LOM files, decompiled them, and uploaded them back to our cloud.

We couldn't just restart client training from scratch as the product operates in blocking mode, with settings to prevent false positives and structures for client application APIs. This project was successfully executed with minimal losses, and we were able to lift the rules back up.

Client Support

Moving forward—I must say, we DID NOT LOSE A SINGLE CLIENT due to this incident. Despite everything initially going against us, we received not only words of support but real assistance in the form of LOM files, as mentioned earlier. Most importantly, we felt genuine understanding and love from our clients, which is hard to quantify or describe.

I want to thank everyone again; it was very heartwarming.

I wrote to the CEO of Google Cloud, and he replied.

We decided to use Google Cloud in Europe to set up a new cloud there. We didn't have a manager in France, only in the USA. Google isn't the easiest company for communication, and they don't process tickets very quickly.

So, I decided to reach out directly to the top on a whim and sent an email to Thomas Kurian, CEO of Google Cloud, with the subject "URGENT: GCP Account Manager is missing in the middle crisis." He responded in just 3 minutes and resolved the issue with the manager. Everything had to be done quickly, and, surprisingly, it worked.

Memorable Souvenirs and Awards

After the incident, we awarded bonuses to all participants (firefighters), which were categorized based on their contributions to the resolution of the consequences and sleepless nights. To add a bit of humor, we also released a series of 'This is fine' meme dogs, each representing a category of "firefighter" who received it.

Hard Lessons Learned

To be short, just take this piece of advice and share it with your team now. Use it as a checklist; just believe me, it’s mandatory.

  1. Don't store backups in the same data center as your data.
  2. Make secondary backups of critical data in the cloud.
  3. Backup your Kubernetes master.
  4. Conduct annual "drills" to simulate which parts of your system will fail under various circumstances, especially if you have many software components in different locations.
  5. Don't be afraid to share accurate and honest status updates with clients; they will help and support you.

Conclusions

We live in a world where a fire in a data center caused by a single uninterrupted power supply can burn down two buildings. Design, fire suppression, and isolation offer no guarantees. Rely only on yourself and your team. And mitigate risks with trusted providers (hint, we're now tested).

Thanks to our clients and our teams: support for handling all communication and nerves, DevOps for their super adequate actions and swift setup of a new European cloud, development for the decompiler of rules and all programmer efforts, detection, everyone I forgot to mention, and of course, Viktor for coordinating the process and the disaster recovery plan that was ready before the fire.

I'm out of words to express my gratitude to everyone involved in this project. Thank you!

If you're interested, consider joining us at Wallarm; we're hiring globally and can help with relocation/nomad. We're cool. Write to us at [email protected] (especially looking for strong technical product managers (TPMs), DevOps, and developers in Go/C).


Written by d0znpp | SSRF bible author, Wallarm founder
Published by HackerNoon on 2024/04/23