How To Reduce Risks And Prepare For The Unknown

To be Always On in uncertain times, mundane must be your new black.

In this article I'll share with you five steps that you can take now to reduce risks, find practical solutions, and prepare for the unknown. It's also my team's five-step journey toward predictable outcomes and customer success.

The context of being Always On

Keeping our customer’s promise to its user base means that we have to ensure zero data loss, zero downtime, and zero or minimal changes in the application’s code base.

Unfortunately, knowledge of the operating environment is a preemptive strike that doesn’t get the respect that it’s due. And it's also a lesson that we learned the hard way this year.

The elephant in our room is a distributed monolith, which precludes the lift-and-shift (rehosting) migration approach.

We tested multiple hypotheses, but the complexity of dealing with intricately interwoven services makes piecemeal and batch migrations unviable options. Which leaves us with the big-bang approach.

Comparing apples to oranges in a cloud-native stack

I’m not saying that being a monolith is archaic or irrelevant, but being a distributed one in the cloud is an obnoxious burden.

My team isn't just migrating the infrastructure and data from one giant cloud to another. Instead, we are changing the entire stack, rearchitecting so to speak, to keep our customer’s promise of being Always On.

Here’s how that stacks up for us after some brainstorming sessions and experiments.

Note: Let's now pull the curtain back on our five steps!

Create a snapshot of the current cloud reality
Choose the foundational pillars for infra and data care
Load test the alternate cloud reality with live traffic
Fix problems as they surface
Brace yourself for the unknowable unknowns

[Prerequisite] Plan for failure when the rubber meets the road

A big-bang migration has only a binary outcome. You either fail or succeed, and even the smartest people cannot plan for what’s unknown and unknowable in the field.

And without observability tools, you’d simply be shooting in the dark. For instance, in the heat of the moment, we'd easily have 50 dashboards and hundreds of metrics being tracked for hours.

Monitoring, tracing, and logging certainly help us fix problems as they surface while we are shifting live traffic from AWS to GCP.

There’s no rollback to AWS after 50 percent of the traffic moves to GCP because there’s no data to sync back to AWS after this point.

Segregation of duties might seem obvious in theory, but it’s a lifesaver in the field. Collaboration ceases to be a platitude when you have barely 6 hours to complete the cutover without hangups on the user experience front.

It’s at such times that you truly appreciate sweating in peace, having fierce friends (Google's team of engineers), and heading out with a plan.

Stuff still breaks, but you aren't left blindsided.

[Step 1] Create a snapshot of the current cloud reality

Risk mitigation is a key concern, which led us to set the following objectives.

To address these objectives, we created an alternate universe within GCP that matches the current reality within AWS. Here’s how the current cloud reality looks like in AWS.

Getting Down to Business

The cloud infrastructure team built the CI/CD pipeline to deploy all the 75+ services and 200+ jobs across 15 GKE clusters. And the data team worked on replicating and getting in sync 220+ tables from the AWS universe to the GCP one.

The data migration journey has its share of challenges.

[Step 2] Choose the foundational pillars for infra and data care

Choosing Google Kubernetes Engine (GKE)

The customer has over 5K servers in their ECS clusters for services and jobs.

To prevent the maintenance overhead associated with managing one cluster, we chose a multi-cluster setup.

The following are some of the implementation decisions around GKE:

15 GKE clusters host 75+ services and 200+ jobs, consuming over 25K cores at peak traffic. Each GKE cluster scales up to 255 nodes.
Dedicated clusters are assigned to run critical jobs. Services are distributed across GKE based on the throughput and the collocation of the dependent services and jobs.
Ingress-nginx is preferred over Istio and Layer 7 Internal Load Balancer (L7 ILB) for inter-cluster and intra-cluster communication.
DaemonSet is deployed on each GKE cluster to customize the sysctl flags at the node level.
NodeLocal DNSCache improves the DNS lookup time and subsequent latency. Without this approach, DNS lookup tends to increase latency because each query goes to kube-dns. After we enable NodeLocal DNSCache, the pods that run on a given node cache the DNS; thereby, resulting in less hits to kube-dns and better response.

Note: The kube-dns pods not being scheduled on node pools with taints is still an open issue.

Choosing Google Cloud Spanner and Bigtable

As maintenance overhead is undesirable to the customer, we aren't even considering open source software (OSS) solutions, such as Cassandra and Aerospike.

Which leaves us with managed database services on GCP that offer comparable performance in terms of latency and support for secondary indexes.

Given these constraints, using Spanner and Bigtable in concert seems like the most logical conclusion.

Bigtable offers comparable performance, but it does not support secondary indexes. Spanner offers secondary indexes.

Our decision to use NoSQL and SQL databases in concert led to the discovery of the game changer that accelerated the cloud-to-cloud migration journey.

[Step 3] Load test the alternate cloud reality with live traffic

After several rounds of functional validation for beta users on the GCP production infrastructure, it's time to load test it.

Due to our big-bang approach, we aren't running load tests for 75+ services in parallel. Besides, we don't want to run the risk of corrupting our production data with test data.

The only way to succeed with end-to-end testing is to rehearse the entire production traffic in a controlled manner.

To learn how we simulated the users' experience on GCP without going live on GCP, see Inter cloud routing using the Zuul API gateway.

With the help of a custom-built solution on top of the Zuul API gateway, we can shadow live traffic from AWS to GCP and uncover several issues with our GCP setup.

Here’s how our current and alternate cloud realities stack up at the time of load testing.

You've already seen the AWS stack (left) in step 1. The following image illustrates the GCP stack (right).

[Step 4] Fix problems as they surface

At 50 percent of peak load testing, Istio buckled and the setup faced socket hangups, increased latency, and crashing of Istio Mixer’s policy functionality.

As we didn't have the luxury of time or an in-house Istio expert to handle load at this scale, we dropped Istio in favor of L7 ILB (Envoy).

Envoy performs well at 50 percent peak load, but it starts to fail at 60 percent. And that’s internal traffic of over 200K requests per second (RPS) that we are talking about here.

After much ado and architectural changes yet again, we chose ingress-nginx. Although that point of contention is finally put to rest, the overall performance is still unsatisfactory to me.

With no DAX equivalent on GCP and the customer’s requirement of zero or minimal change in the application’s code base, we spent days optimizing Redis and Spanner to prevent database snags at peak traffic.

We've listed some of these problems in the following table.

Problems surfaced and resolved

Hyperlink in the table: Spanner's design best practices

[Step 5] Brace yourself for unknowable unknowns

After performance optimization and several days of testing at peak load, it was time to flip the switch and make GCP the current cloud reality.

Everyone involved was ready to get this big-bang migration over with. However, everything came to a halt 24 hours before the actual cutover!

A stranger walks into our lives.

COVID-19 derailed our migration journey. How so?

Organisations of all shapes and sizes were switching to Google Cloud to ensure business continuity and support remote work as the pandemic hit.

Due to the sudden surge in cloud usage, we decided to revisit infrastructural needs and hence took a step back to reevaluate.

This was very helpful as we had time to review all our efforts and iron out minor issues.

Our efforts paid off when we performed that gargantuan leap from one cloud to another with zero service downtime and zero data loss.

We kept the customer’s promise of being Always On by making mundane the new black.