Hackernoon logoMethodology Of Autoscaling For Kinesis Streams by@hariohmprasath

Methodology Of Autoscaling For Kinesis Streams

Author profile picture

@hariohmprasathHari Ohm Prasath

https://www.linkedin.com/in/hariohmprasath/

Problem statement

Since AWS doesn’t support auto-scaling for kinesis streams most of the time we either over provision the shards and pay more or under provision them and taking a hit on the performance.

Solution:

Using a combination of cloud watch alarms, SNS topics and Lambda we can implement auto-scaling for kinesis streams through which we can manage the shard count and hitting the right balance between cost and performance.

Overview of the solution:

Both scale up and scale down by implemented by monitoring “PutRecords.Bytes” metrics of the stream:
Scale up (doubling of shards) will automatically happen once the stream is utilized more than 80% of its capacity at least once in the given 2 minutes rolling window.Scale down (reducing the shards into half) will automatically happen once the stream is not utilized more than 40% of its capacity at least 3 times in the given 5 minutes rolling window
Note:
We can also use “IncomingBytes” or “IncomingRecords” to implement the same.
We scale up really quick so we don’t take a hit on performance and scale down slowly so we can avoid too many scale up and scale down operations.

Implementation details:

Determine % of utilization for kinesis streams:
Lets say:
Payload size = 100 KB
Total records (per second) = 500
AWS recommended shard count = 49
To determine whether its 80% utilized we do the following:
Total bytes max can be written in 2 minutes = (((100 * 500)*60)*2) = 6,000,000 KB
80% of it would be = 4,800,000 KB
Similarly 40% of it would be 2,400,000 KB

Scale out

The below diagram shows the flow when we perform a scale out operation
Configuration for “Scale out alarm” would be:
  • Metric Name: PutRecord.Bytes
  • Stream Name: ETL Kinesis stream
  • Period: 2 minutes
  • Threshold >= 4800000Datapoints: 1
  • Statistic: Sum
  • Action: Notify topic “Scale SNS topic”
So when we reach 80% of the stream capacity the following happens:
  • Cloud watch alarm “Scale out alarm” will be triggered
  • Notification is sent to SNS topic “Scale SNS topic” which triggers “Scale out lambda”
  • Lambda will scale the number of shards = current shards * 2 and update the threshold to the based on the new shard count. Let’s say the new shard count is “98”, then
  • Max bytes that can be written in 2 minutes (100 KB * 1000 records) = 12,000,000 KB
  • 80% of it would be 9,600,000 KB
  • 40% of it would be 4,800,000 KB
  • Reset the “Scale out alarm” back to “OK” state from “ALARM” state

Scale in

The below diagram shows the flow when we perform a scale in operation
Note:
40% value is calculated for 15 minute interval = (100*1000*15*60)*40/100 = 36,000,000
Configuration for “Scale in alarm” would be:
  • Metric Name: PutRecord.Bytes
  • Stream Name: ETL Kinesis stream
  • Period: 15 minutes
  • Threshold <= 36000000
  • Datapoints: 3
  • Statistic: Sum
  • Action: Notify topic “Scale in SNS topic”
So when we utilize only 40% of the stream capacity the following happens:
  • Cloud watch alarm “Scale in alarm” will be triggered
  • Notification is sent to SNS topic “Scale in SNS topic” which triggers “Scale in lambda”
  • Lambda will scale the number of shards = current shards / 2 and update the threshold to the based on the new shard count.
  • Reset the “Scale in alarm” back to “OK” state from “ALARM” state

Lambda (scale out)

The attached code “Lambda.js” is written in nodeJs and just for demo purpose so instead of dynamically calculating the threshold value its hardcoded, but the same code can be enhanced to determine all of things mentioned in this post

Environment variables:

  • ALARM_NAME = “Scale out alarm”
  • MAX_SHARDS = 100
  • MIN_SHARDS = 10
  • STREAM_NAME = “ETL Kinesis stream”

Pseudo code:

  • Describe the stream and only continue if the stream is not “UPDATING” (if previous scale up/down action has not completed)
  • Calculate the new shardCount
  • Update the cloudwatch alarm threshold using “putMetricAlarm()”
  • Update the shard count using “updateShardCount()”
  • Wait for shard scale up or scale down operation to complete using “describeStream()” (not implemented in the current code) then reset the state of the alarm to “OK” state using “setAlarmState()”

COGS reduction:

In order to get 1000 TPS for payload of record size 100KB we need to pay 1223.62$, by this approach we can control scale up and scale down which directly reduces the cost of these AWS resources
I am happy to take any feedback or improvements to this approach, thanks for reading it.

Tags

The Noonification banner

Subscribe to get your daily round-up of top tech stories!