Product Developer at CloudCover
Game changers emerge when you are deep in the trenches laying the foundation.
In this article I’m going to unveil the game changer that my colleagues Shashank and Sweeti alluded to in their articles "3 Risk-Mitigation Lessons That We Learned The Hard Way This Year" and "How To Reduce Risks And Prepare For The Unknown".
It accelerated our customer’s cloud-to-cloud migration journey with almost no change in the application’s code base and no adverse impact on latency.
Speaking of games, this belief held by some that game changers have a 30,000-foot view of the problem piques my interest.
My two cents to this belief is that true transformation has its roots in the trenches where the foundation is laid.
And game changers, while working in those trenches, figure out exactly which foundational pieces provide the mechanical advantage to change the game itself.
From this lens, I feel that defining the problem space in terms of business challenges and customer needs seems essential to laying the foundation and gaining that advantage. So, that's where I'll start.
Let’s now look at the problem space that makes this customer promise a daunting task.
In his article, Shashank alluded to in-flight refueling maneuvers to illustrate the risks that are associated with a migration at this scale. And he shared five objectives to mitigate risks and to ensure predictability.
Laying the foundation, in this context, means knowing your operating environment and your formidable opponent.
My formidable opponent was none other than Amazon’s NoSQL beast on steroids: DynamoDB with DAX (managed cache solution).
It’s naive to think of competing against this beast with a SQL database. Nobody in their right mind would ever consider these two comparable.
DynamoDB with DAX is in a league of its own. You CANNOT tame this one with a SQL database.
So, my greatest challenge is two-fold: First, how do I tame this beast on steroids? And second, how do I accomplish this feat in less than 60 days?
Speed, cost, and quality of service are all critical and at the same time. I’m sure conventional project management practices balance these three pillars to offer a realistic outcome.
However, the gospel that Cloud Computing and Cloud Native Computing are proclaiming to even those on the fence makes such discussions moot. So, let’s not even talk about tech startups that are soaking in cloud-native technologies.
Also, in 2020, it’s pointless resisting this wave when I might have a better shot at riding it. I rest my case exactly at this point.
Here I am, in the trenches, laying the foundation for that gravitational shift. And I need all the help that I could get and some more.
In her article, Sweeti described how we chose the foundational pillars for infrastructure and data care.
After testing our hypotheses, we learned that taming this beast on steroids needs three players to work in concert: Google Cloud Spanner, Bigtable, and Redis.
So, we benchmarked the setup for almost a month to figure out if its performance on latency is anywhere near that of DynamoDB and DAX combined. Obviously, it isn't.
However, after evaluating its impact on the users' experience, which is negligible, our customer settled for 30-50 ms of latency as acceptable.
I don’t need to tell you that DynamoDB with DAX can probably offer a response time of 3 ms.
And I benchmarked the DynamoDB database in its entirety because I feel that I might never get another opportunity to develop a solution for over 500K queries per second (QPS).
The Foundational Test
This step is critical because our ability to deliver on the customer’s promise of being Always On depends on how well our foundational pieces hold up against the customer’s existing stack on AWS.
The following image illustrates how the GCP stack mirrored the customer’s AWS stack.
It's probably hard to see this image, and you can see the individual cloud stacks next. However, I used this image to draw your attention to the line between the external Zuul API gateway on the AWS side and Google Cloud Load Balancer (GCLB) on the GCP side. It indicates how we used a portion of ShareChat's live traffic for load testing.
The following image illustrates the customer's AWS stack.
The following image illustrates the customer's GCP stack.
Let’s briefly look at the pieces on the GCP side.
With this first gigantic piece in place, which isn't as easy to come by as it is to write about, we continued to dig deeper to find the other levers to effect that gravitational shift.
Spanner Schema Choices
We had no clue to what extent the customer’s application depended on the DynamoDB interface, or what were all the date types and fields per object.
Getting the schema configuration right is the prerequisite to all the subsequent tasks.
We stored the schema configuration in a table in Spanner, which has four fields: tableName, column, originalColumn, and dataType.
Spanner supports only one special character, underscore (_). However, DynamoDB supports others too. So, the column and originalColumn fields have different values when a special character other than an underscore is present.
With the schema in place, we chose four application services to determine the extent of coupling between the services and DynamoDB.
The game changer that we now call dynamodb-adapter was a primitive service that I developed as a stopgap while we were all still in the trenches.
Its sole purpose was to enable the application services to field test against it so that we can figure out the gaps in terms of the coverage of the DynamoDB functionalities.
Errors start to unravel the service-to-database coupling story. And that’s why Shashank emphasized the need to honor the mundane routines in his article.
It’s also why we decided to set our first objective of decoupling the services from the database.
Atomic Increments in Bigtable
Field testing moved dynamodb-adapter to its next iteration and put in place another component called Atomic Counter.
Our initial data-type mapping for the N number type within DynamoDB is FLOAT64 within Spanner. However, of the 110 tables that we migrated to Bigtable, 12 had atomic counters of type INT.
Bigtable does not support atomic increments for Float values. So, all counter fields are stored as Integer, and the other N type variables are stored as Float.
Query and Data Transformation
As I dug deeper through the field errors and improved dynamodb-adapter, it was clear to me that I'd have to create a channel to enable communication among four distinct object types.
Bigtable’s storage and retrieval mechanism involves binary data, that of Spanner involves interface data, that of DynamoDB is totally different from Bigtable and Spanner, and DynamoDB Stream objects are a different beast altogether.
So, how do I effect change without changing the application’s code base? From this daunting question, emerged the following components.
Up to this point, I built seven components to solve the technical challenges as they surfaced. These are: Schema Configuration, Atomic Counter, Config Controller, Query Transformer, Response Parser, DynamoDB Stream Object Collector, and DynamoDB Stream Object Creator.
Each of the following components have a behind-the-scenes drama that can lend itself to another article. But to spare you that drama and for the sake of brevity (considering how long this article already is), I’ll try my best to cover the highlights.
The Sidecar Pattern
I placed dynamodb-adapter (erstwhile DB Driver) within the same Kubernetes pod as the application service for three reasons:
This sidecar pattern, as illustrated in the following image, enables dynamodb-adapter (erstwhile DB Driver) to scale along with the application service and eliminates the burden of maintenance.
Session Count Manager
We had 15 GKE clusters hosting over 75 services and 200 jobs, consuming over 25K cores at peak traffic. And dynamodb-adapter had access to seven Spanner instances.
If we consider 100 sessions, our total session count is as follows:
Session_count = 25K*7*100 = 17.5 million sessions
where 25K is the number of pods, 7 is the number of Spanner instances, and 100 is the minimum number of sessions
You might be wondering why we are engaging all seven Spanner instances. That seems ludicrous, right?
The answer lies in the second lesson that Shashank shared in his article:
"Run away as fast and as far as you can from a distributed monolith."
Because of the inherent nature of the customer's platform, we had no clue about service-to-table mapping.
It’s something that Sweeti also highlighted in her article when she was fixing problems as they surfaced.
All services have direct access to a common data pool, which is not the case in a microservices architecture. Each microservice has access to only its data, and access to another microservices’ data is through a service-to-service call.
As always, the Google Spanner team was by our side. Through their internal logs, they figured out which application service was hitting which Spanner instance; thereby enabling us to engage only the relevant Spanner instances.
Addressing this issue that was deep in the trenches led me to create Session Manager.
Session Manager tags each session in the session pool and enables each dynamodb-adapter service to engage only those Spanner instances that are relevant to its service.
Slow Query Logging
Having fixed as many problems as they surfaced, we continued to face new challenges. So, it was time to mitigate risks through Shashank’s third objective: Bring more visibility to the services.
In her article, Sweeti explained why we replaced Istio and Envoy with ingress-nginx and how that addressed some performance issues.
However, she was unhappy with the overall performance because we couldn’t figure out why responses to four key services were exceeding the timeout threshold of two minutes.
We are aware that building a caching layer in front of Spanner can address some of these issues, so we added Redis. However, we needed to figure out what’s causing the performance snag.
To view all the queries and indexes across the system, we used Lightstep (an observability tool).
I investigated all the queries that hit the timeout threshold and developed a component called Slow Query Logging to identify and ignore all the slow queries. That is, queries that result in a response time of more than 10 seconds.
Data Consistency with DB Audit
The DB Audit component ensures data consistency between DynamoDB and Spanner/Bigtable.
When a query reaches Spanner or Bigtable, DB Audit fetches the row from DynamoDB and performs a row-to-row data audit.
It sends both the objects from DynamoDB and Spanner/Bigtable to the Audit service via REST API, which calculates the difference between the objects. BigQuery serves as the datastore for this service.
NOTE: DB Audit isn’t a core component of dynamodb-adapter, but it’s critical to boost your confidence during such migrations because it tells you if everything crossed over.
Also, I recommend that you turn on DB Audit to ensure data consistency if you intend to explore a database-migration route from DynamoDB to Spanner.
Being born in the cloud helps us tremendously because it naturally makes the cloud our foundation (home turf), so helping our customers ride the cloud is second nature to us.
But while working in the trenches, we figured out exactly which foundational pieces provide that mechanical advantage to change this migration game.
The following image illustrates how the components fit together to provide this mechanical advantage.
All of this materialized without touching the application’s code base and without increase in the agreed-upon latency.
Config Controller emerged as the primary lever, and it enabled dynamodb-adapter (erstwhile DB Driver) to become almost 80% configuration-driven.
Config Controller controls the read and write operations on a per-table basis. It also controls the operations of the DB Audit component on a percentage basis per table.
We stored its configuration in one of the Spanner tables with the following columns.
[Spoiler Alert] We are collaborating with the Google Cloud Spanner team to develop dynamodb-adapter as an open source project under the Cloud Spanner Ecosystem.
I’ll end this article with my quote that Shashank shared with you in his article. And I’m sure that you now have a better view of why I said that.
"We are creating a query translation engine that performs string processing logic and complicated database operations. Latency should increase, right? But, it does not. We delivered a solution with no increase in latency!"