Scaling Costs with DynamoDB

Original Post: https://blog.polymail.io/post/scaling-costs-with-dynamodb

As a company grows, more and more data is accumulated. The exciting and terrifying part about working with email is that users come in with a large data set from the get go. Early on, we knew we needed a database that would allow us to scale fast.

There are generally two ways companies go about this.

Scaling Vertically

Just upgrade to a bigger instance with more CPU, more RAM, and more storage. All you have to do is just pay more. 💸 💸 Contrary to popular belief, this is a viable short term approach, since it requires little to no engineering investment.

Scaling Horizontally

Alternatively, you can split you database across multiple smaller, cheaper instances (or machines). The rise of open source and NoSQL technologies in the past decade makes this much more feasible to do.

Breaking it down even further, there are a few ways you can go about scaling horizontally, or shard.

You can put different tables on different instances.

This is the easiest option. It also has the added benefit of the bulkhead pattern, meaning you can isolate your critical tables, such as the users table, from the less critical services, such as the notifications table. As long as you never have to do joins then you’re solid! But this won’t work for us because we’re trying to shard a single thread table.

You can do ranged based sharding.

In practice this is very difficult and messy to do. You need to pick suitable sizes for your ranges, handle assigning keys appropriately to make sure no instance is overloaded, and handle reallocation of keys when an item is deleted.

You can do hash based sharding.

This is a bit easier than ranged based sharding, because the hash function will handle balancing for you, as long as you pick a suitable key that distributes your dataset in a uniform manner. For example, a user ID might be a uniform key if each of your users have approximately the same amount of data.

Scaling the number of instances

For the first two it’s easy; if you have more tables or ranges, just put them on a new machine.

For hash-based sharding, you have to rebalance the cluster when you spin up a new instance, because keys that might have originally mapped to one instance might have changed after introduction of new instances. Of course, we’re engineers. We’re smart. We can use consistent hashing to reduce this rebalancing work.

Other Factors

Scalability is only one aspect of a database. But it doesn’t stop here. How about availability? How about redundancy? How about security? These are real questions infrastructure engineers have to think about.

Managed Services

To address these issues, we turned our eyes to managed database services such as AWS’s DynamoDB and GCP’s Spanner. Check out our post below to read more about capitalized on these services to handle tens of millions of emails our users produce, and our real-world experiences them!

Scaling Costs with DynamoDB_A story about how we capitalized on managed databases services to address scalability issues tens of millions of…_blog.polymail.io