As explained in part one, I recently presented a talk at the AWS Community Day in Bangalore. The tweet following the talk became my most popular tweet ever and I received quite a few requests for more details.
This is part two of the blog post. Hope you enjoy as-well! Please do not hesitate to give feedback, share your own stories or simply like :)
Every single time I worked with applications that needed to scale up, the database was the first component to fail — or have issue.
I never really totally understood why, but it feels like most of the teams I have worked with were scared of doing anything with the database, me included by the way. Maybe that is related to the “traditional” difficulty of running databases and replicating them. For me, looking backwards, the reason is because I was always scared of losing the data — which also means I was not confident enough in my capability in restoring the data in case of failure (I will delve on that topic in another post maybe).
People often think that if you want to scale the database, you need to use NoSQL. The problem with this statement is that it is entirely wrong. Most companies don’t have the scale problem of Amazon.com, Facebook or Google, yet a lot of them embark blindly on the NoSQL road thinking it will solve all the problems. Before you hit a roadblock with SQL databases, you have a fair amount of margin.
Why start with SQL?
Why you might need NoSQL?
“A deep dive on how we were using our existing databases revealed that they were frequently not used for their relational capabilities. About 70 percent of operations were of the key-value kind, where only a primary key was used and a single row would be returned. About 20 percent would return a set of rows, but still operate on only a single table.” Werner Vogel — All Things Distributed (2012)
Assuming you have a clear data domain, that you are not using anti-patterns (see part one) and that you have decoupled the database layer from the business layer, the road to scaling your database has three steps (not in particular order).
My favourite way to start scaling the database — and also the simplest in my opinion. Push all the “write” operations to the master instance and all the “reads” to the read replicas**. Of course, you need to understand and design your application to be somehow resilient to data inconsistency (because of the replication lag) — but I found that it often was not an issue.
Note1: Amazon Aurora supports 15 read-replicas and Amazon Aurora with MySQL engine can process up to 200,000 writes/second on an R4.16xlarge instance — so really — you aren’t going to break it in your first 10 million users. No really, you wont!
The important thing here is that each DB in the federation is self-sustained and independent, thus helping in spreading the load between instances. However, if your queries have a lot of relation, in this case between “products” and “users”, you will have to do a lot of join operations — which isn’t great.
Also referred as horizontal or range partitioning, where the application selects a partition by determining if the partitioning key is within a certain range. Each shard is located on a separate instance in order to spread load. This technic is often used in geographical partitioning for global applications.
Since you should always plan for the future, when designing applications, always have those three steps in minds, you will make better overall design decisions.
Note2: Things are getting easier and easier though — especially since Amazon Aurora now supports Multi-Master and soon will go Serverless. This promises to further help in the journey to scaling your database.
Many of the lessons I have learned have had one thing in common: measuring — or lack thereof. So one of the most important thing to do in your team is nurturing a culture of measuring everything! Measure even things that you don’t think are important now, because one day, you will need it, directly, or indirectly. Storage is cheap anyway — so don’t save on that.
In general, you should always measure at three-different levels: Application, Network and System (on the machine itself) level. Why three levels? because you want to figure out relationships between causes and effects.
For example, if you want to measure (and you should) the number of update queries your database is doing on the application level, you should also measure the network latency of those requests, the number of sockets opened on the instance, the length of the waiting queue in the database, the number of timeout requests on the load balancer, the number of people closing the application, etc... You will often be surprised with the different relationships you will find.
So, how do you nurture a culture of measuring everything? Make it ridiculously easy to measure, anywhere in the system, at all levels, and for anyone. One line of code should be all it takes to start a new measurement and graph it out.
Note: I love python decorators for doing that on the application level.
But don’t forget to setup alarms and escalation path once you have measurements in place. Test the alarms and make the response channel highly-available too. If your response channel relies on email, test the email service regularly. It would be sad if your system went down because of email limits (story of my life).
Most responses can be automated so learn and work to automate responses — from the beginning. Incorporate those metrics and measurements in weekly reviews with your DevOps team, to see how things improve over time.
Note: One metric I found particularly important is the cost of one user in the system. “How much does a user cost?” — I once was asked that question by one of my previous CEO, and after some hard work, we were able to estimate, pretty accurately, the cost of each user. Needless to say that it became an extremely important metric, both financial and operational.
Measuring without operational targets is often useless.
Another important reason to nurture a culture of measuring everything is for measuring progress. Measuring progress and regression, and showing it to the rest of the company, is a great motivator for developers!
But neither of those will make any difference if you have no idea of the operational targets. What should be the response time for your user to have a good experience? What should be the number of concurrent connections per each instances? What should be the scaling up or down time? All those questions need answers — but you first need the questions.
They are plenty of ways for you to find the right questions to answer — but the most simple one is to sit down with the team and ask them. When was the last time you did that?
Failures are a given and everything will eventually fail over time. Werner Vogels
Chances are, even if you measure everything, test everything, and do everything by the Netflix book — things are still going to break. Sometimes it might be a squirrel, a shark, lightning, a truck-driver, burglars … you can never plan for everything mother-nature can throw at you (just search google for amazing stories).
So, how do you prepare for the unexpected?
Put a lot of effort in preserving data! At the end of the day, data is gold. If everything crashes but you still have the data, you are kinda “safe”. This means two very important things:
1- Perform backups on regular basis.
2- Test your backup. A backup is not a backup, unless you have tested its recovery.
They are countless stories of nightmare events where companies have tried restoring data from backups that didn’t work. It happened to me and I can tell you that I never want it to happen again.
As already mentioned in part one, embrace failure and chaos engineering methodologies and make sure everyone in the team understands what “limiting the blast radius” means. In fact, this is one of the reasons I really like the idea of micro-services: they are easier do manage in case of fire.
Everyone smiles at this meme, but unfortunately, many still operate that way.
My question to you is the following:
How do you create responsible development teams?
In my experience, giving more responsibility usually transfers into becoming more confident and having more “team” purpose — thus better overall performance. Of course, it also means you should hire great people, but this is not the topic of that post.
“You build it, you run it!” creates better resiliency from the get go: no one wants to be paged at 3am on Christmas Eve right? The main advantage of this approach is that when designing new applications, your team will already have a focus on the operating aspects of software development. But they are few other advantages to it:
Learning from others is the single most important thing I have learned. And I have to admit, sometimes I still have to remind myself the following:
Shut up and Listen.
Indeed, there is always someone in the room that knows more than you do. That person is just not necessarily broadcasting it.
Be open and ready to be challenged and change opinion. Share your ideas, challenge your opinions and especially let others challenge them. Fortunately, there are myriades ways to help.
Read technical blogs — here are some of my favourite ones.
Participate in meetups.
Participate in conferences.
I went few times as a customer to re:Invent, AWS’s largest conference. Every single time, I came back as if I had found the holy grail. And it wasn’t because of the technical sessions (which are great by the way) but because I met, discussed and shared ideas with total strangers, like you and me. Strangers that only wanted to get better, strangers that didn’t mind sharing a bit of their knowledge. And for that, I am forever grateful to them because without them, I would not be able to write that blog post. Thank you!