Throughout my career, I had to implement mechanisms for scheduling ad-hoc tasks a number of times. Often, they are part of a much bigger system.
For instance:
The TL;DR is that, I need a way to execute a piece of code at a specified point in time in the future. It’s possible to do this in just about every programming language. For example, .Net has the [Timer](https://docs.microsoft.com/en-us/dotnet/api/system.timers.timer?view=netframework-4.7.2)
class and JavaScript has the [setInterval](https://www.w3schools.com/jsref/met_win_setinterval.asp)
function. But I find myself wanting a service abstraction to work with instead. Sadly, AWS does not offer a service for this type of workloads. CloudWatch Events is the closest thing, but not quite.
CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources. EC2 instance terminated, ECS task started, Lambda function created, etc.
To react to these system events, you can subscribe a Lambda function to an Event Pattern. Whenever an event is matched, CloudWatch Events would invoke the target Lambda function on your behalf.
In addition, CloudWatch Events also lets you create cron jobs easily.
This lets you invoke a Lambda function based on a fix rate (down to every minute). Or, you can specify a custom schedule using a cron expression.
However, CloudWatch Events is not designed for running lots of ad-hoc tasks, each to be executed once, at a specific time.
The default limit on CloudWatch Events is a lowly 100 rules per region per account. It’s a soft limit, so it’s possible to request a limit increase. But the low initial limit suggests it’s not designed for use cases where you need to schedule millions of ad-hoc tasks.
CloudWatch Events is designed for executing recurring tasks.
Because there are no other suitable AWS services, I had to implement a scheduling service myself a few times in my career. I experimented with a number of different approaches:
Timer
class as an HTTP endpointLately, I have seen a number folks use DynamoDB Time-To-Live (TTL) to implement these ad-hoc tasks. In this post, we will take a look at this approach and see where it might be applicable for you.
For this type of ad-hoc tasks, we normally care about:
From a high level this approach looks like this:
scheduled_items
DynamoDB table which holds all the tasks that are scheduled for execution.scheduler
function that writes the scheduled task into the scheduled_items
table, with the TTL set to the scheduled execution time.execute-on-schedule
function that subscribes to the DynamoDB Stream for scheduled_items
and react to REMOVE
events. These events corresponds to when items have been deleted from the table.Since the number of open tasks just translates to the number of items in the scheduled_items
table, this approach can scale to millions of open tasks.
DynamoDB can handle large throughputs (thousands of TPS) too. So this approach can also be applied to scenarios where thousands of items are scheduled per second.
When many items are deleted at the same time, they are simply queued in the DynamoDB Stream. AWS also autoscales the number of shards in the stream, so as throughput increases the number of shards would go up accordingly.
But, events are processed in sequence. So it can take some time for your function to process the event depending on:
So, while this approach can scale to support many tasks all expiring at the same time, it cannot guarantee that tasks are executed on time.
This is the big question about this approach. According to the official documentation, expired items are deleted within 48 hours. That is a huge margin of error!
As an experiment, I set up a Step Functions state machine to:
scheduled_items
table, with TTL expiring between 1 and 10 minsexecute-on-schedule
functionThe state machine looks like this:
I performed several runs of tests. The results are consistent regardless the number of items in the table. A quick glimpse at the table tells you that, on average, a task is executed over 11 mins AFTER its scheduled time.
I repeated the experiments in several other AWS regions:
I don’t know why there are such marked difference between US-EAST-1 and the other regions. One explanation is that the TTL process requires a bit of time to kick in after a table is created. Since I was developing against the US-EAST-1 region initially, its TTL process has been “warmed” compared to the other regions.
Based on the result of my experiment, it will appear that using DynamoDB TTL as a scheduling mechanism cannot guarantee a reasonable precision.
On the one hand, the approach scales very well. But on the other, the scheduled tasks are executed at least several minutes behind, which renders it unsuitable for many use cases.
Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
You can contact me via Email, Twitter and LinkedIn.
Check out my new course, Complete Guide to AWS Step Functions.
In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.
Get your copy here.
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
You can also get 40% off the face price with the code ytcui.
Get your copy here.