Here at Space Ape Games we developed an in-house tech to auto scale DynamoDB throughput and have used it successfully in production for a few years. It’s even integrated with our LiveOps tooling and scales up our DynamoDB tables according to the schedule of live events. This way, our tables are always provisioned just ahead of that inevitable spike in traffic at the start of an event.
It looks as though the author’s test did not match the kind of workload that DynamoDB Auto Scaling is designed to accommodate:
In our case, we also have a high write-to-read ratio (typically around 1:1) because every action the players perform in a game changes their state in some way. So unfortunately we can’t use DAX as a get-out-of-jail free card.
When you modify the auto scaling settings on a table’s read or write throughput, it automatically creates/updates CloudWatch alarms for that table — four for writes and four for reads.
As you can see from the screenshot below, DynamoDB auto scaling uses CloudWatch alarms to trigger scaling actions. When the consumed capacity units breaches the utilization level on the table (which defaults to 70%) for 5 mins consecutively it will then scale up the corresponding provisioned capacity units.
From our own tests we found DynamoDB’s lacklustre performance at scaling up is rooted in 2 problems:
Based on these observations, we hypothesize that you can make two modifications to the system to improve its effectiveness:
As part of this experiment, we also prototyped these changes (by hijacking the CloudWatch alarms) to demonstrate their improvement.
The most important thing for this test is a reliable and reproducible way of generating the desired traffic patterns.
To do that, we have a recursive function that will make BatchWrite
requests against the DynamoDB table under test every second. The items per second rate is calculated based on the elapsed time (t
) in seconds so it gives us a lot of flexibility to shape the traffic pattern we want.
Since a Lambda function can only run for a max of 5 mins, when context.getRemainingTimeInMillis()
is less than 2000 the function will recurse and pass the last recorded elapsed time (t
) in the payload for the next invocation.
The result is a continuous, smooth traffic pattern you see below.
We tested with 2 traffic patterns we see regularly.
This should be a familiar traffic pattern for most — a slow & steady buildup of traffic from the trough to the peak, followed by a faster drop off as users go to sleep. After a period of steady traffic throughout the night things start to pick up again the next day.
For many of us whose user base is concentrated in the North America region, the peak is usually around 3–4am UK time — the more reason we need DynamoDB Auto Scaling to do its job and not wake us up!
This traffic pattern is characterised by a) steady traffic at the trough, b) slow & steady build up towards the peak, c) fast drop off towards the trough, and repeat.
This sudden burst of traffic is usually precipitated by an event — a marketing campaign, a promotion by the app store, or in our case a scheduled LiveOps event.
In most cases these events are predictable and we scale up DynamoDB tables ahead of time via our automated tooling. However, in the unlikely event of an unplanned burst of traffic (and it has happened to us a few times) a good auto scaling system should scale up quickly and aggressively to minimise the disruption to our players.
This pattern is characterised by a) sharp climb in traffic, b) a slow & steady decline, c) stay at a stead level until the anomaly finishes and it goes back to the Bell Curve again.
We tested these traffic patterns against several utilization level
settings (default is 70%) to see how it handles them. We measured the performance of the system by:
These results will act as our control group.
We then tested the same traffic patterns against the 2 hypothetical auto scaling changes we proposed above.
To prototype the proposed changes we hijacked the CloudWatch alarms created by DynamoDB auto scaling using CloudWatch events.
When a PutMetricAlarm
API call is made, our change_cw_alarm
function is invoked and replaces the existing CloudWatch alarms with the relevant changes — ie. set the EvaluationPeriods
to 1 minute for hypothesis 1.
To avoid an invocation loop, the Lambda function will only make changes to the CloudWatch alarm if the EvaluationPeriod has not been changed to 1 min already.
The change_cw_alarm function changed the breach threshold for the CloudWatch alarms to 1 min.
For hypothesis 2, we have to take over the responsibility of scaling up the table as we need to calculate the new provisioned capacity units using a custom metric that tracks the actual request count. Hence why the AlarmActions
for the CloudWatch alarm is also overridden here.
The SNS topic is subscribed to a Lambda function which scales up the throughput of the table.
The test is setup as following:
All the units in the diagrams are of SUM/min, which is how CloudWatch tracks ConsumedWriteCapacityUnits
and WriteThrottleEvents
, but I had to normalise the ProvisionedWriteCapacityUnits
(which is tracked as per second unit) to make them consistent.
Let’s start by seeing how the control group (vanilla DynamoDB auto scaling) performed at different utilization levels from 30% to 80%.
I’m not sure why the total consumed units
and total request count
metrics didn’t match exactly when the utilization is between 30% and 50%, but seeing as there were no throttled events I’m going to put that difference down to inaccuracies in CloudWatch.
I make several observations from these results:
Some observations:
Scaling on actual request count and using actual request count to calculate the new provisioned capacity units yields amazing results. There were no throttled events at 30%-70% utilization levels.
Even at 80% utilization level both the success rate
and total no. of throttled events have improved significantly.
This is an acceptable level of performance for an autoscaling system, one that I’ll be happy to use in a production environment. Although, I’ll still lean on the side of caution and choose a utilization level at or below 70% to give the table enough headroom to deal with sudden spikes in traffic.
The test is setup as following:
Once again, let’s start by looking at the performance of the control group (vanilla DynamoDB auto scaling) at various utilization levels.
Some observations from the results above:
Some observations:
Similar to what we observed with the Bell Curve traffic pattern, this implementation is significantly better at coping with sudden spikes in traffic at all utilization levels tested.
Even at 80% utilization level (which really doesn’t leave you with a lot of head room) an impressive 94% of write operations succeeded (compared with 73% recorded by the control group). Whilst there is still a significant no. of throttled events, it compares favourably against the 500k+ count recorded by the vanilla DynamoDB auto scaling.
I like DynamoDB, and I would like to use its auto scaling capability out of the box but it just doesn’t quite match my expectations at the moment. I hope someone from AWS is reading this, and that this post provides sufficient proof (as you can see from the data below) that it can be vastly improved with relatively small changes.
Feel free to play around with the demo, all the code is available here.
theburningmonk/better-dynamodb-scaling_better-dynamodb-scaling - Make DynamoDB's autoscaling action happen faster_github.com
Like what you’re reading? Check out my video course Production-Ready Serverless and learn how to run a serverless application in production.
We will cover topics including:
and include all the latest changes announced at the recent AWS re:Invent conference!
Production-Ready Serverless_See it. Do it. Learn it! Production-Ready Serverless: Operational Best Practices introduces you to leading patterns and…_bit.ly