How do you control the number of concurrent state machine executions that can access a shared resource? Here at , we are migrating from our legacy platform into a brave new world of and microservices. Along the way, we also discovered the delights that AWS Step Function has to offer, for example… DAZN microfrontends flexible error handling and retry the understated ability to wait between tasks the ability to mix automated steps with that require human intervention activities In some cases, we need to control the number of concurrent state machine executions that can access a shared resource. This might be a business requirement. Or it could be due to scalability concerns for the shared resource. It might also be a result of the design of our state machine which makes it difficult to parallelise. We came up with a few solutions that fall into two general categories: Control the number of executions that you can start Allow concurrent executions to start, but block an execution from entering the critical path until it’s able to acquire a (i.e. a signal to proceed) semaphore Control the number of concurrent executions You can control the MAX number of concurrent executions by introducing a SQS queue. A CloudWatch schedule will trigger a Lambda function to… check how many concurrent executions there are if there are N executions, then we can start MAX-N executions poll SQS for MAX-N messages, and start a new execution for each We’re not using the new here because the purpose is to the creation of new executions. Whereas the SQS trigger would push tasks to our Lambda function eagerly. SQS trigger for Lambda slow down Also, you should use a FIFO queue so that tasks are processed in the same order they’re added to the queue. Block execution using semaphores You can use the API to find out how many executions are in the RUNNING state. You can then sort them by and only allow the eldest executions to transition to states that access the shared resource. ListExecutions startDate Take the following state machine for instance. The state invokes the Lambda function and returns a flag. The state then branches the flow of this execution based on the flag. OnlyOneShallRunAtOneTime one-shall-pass proceed Shall Pass? proceed :Type: TaskResource: arn:aws:lambda:us-east-1:xxx:function: Next: Shall Pass? :Type: ChoiceChoices:- Variable: # check if this execution should proceedBooleanEquals: Next: Default: WaitToProceed # otherwise wait and try again later :Type: WaitSeconds: 60Next: OnlyOneShallRunAtOneTime OnlyOneShallRunAtOneTime one-shall-pass Shall Pass? $.proceed true SetWriteThroughputDeltaForScaleUp WaitToProceed The tricky thing here is how to associate the Lambda invocation with the corresponding Step Function execution. Unfortunately, Step Functions does not pass the execution ARN to the Lambda function. Instead, we have to pass the execution name as part of the input when we start the execution. const = uuid().replace(/-/g, '_')const input = JSON.stringify({ , bucketName, fileName, mode }) const req = { stateMachineArn, name, input }const resp = await SFN.startExecution(req).promise() name name When the function runs, it can use the execution from the input. It’s then able to match the invocation against the executions returned by . one_shall_pass name ListExecutions In this particular case, only the eldest execution can proceed. All other executions would transition to the state. WaitToProceed module.exports.handler = async ( , context) => {const executions = await listRunningExecutions()Log.info(`found ${executions.length} RUNNING executions`) input const oldest = _.sortBy(executions, x => .getTime())[0]Log.info(`the oldest execution is [${oldest.name}]`) x.startDate if ( === ) {return { ...input, proceed: true }} else {return { ...input, proceed: false }}} oldest.name input.name Compare the approaches Let’s compare the two approaches against the following criteria: . How well does the approach cope as the number of concurrent executions goes up? Scalability . How many moving parts does the approach add? Simplicity . How much extra cost does the approach add? Cost Scalability Approach 2 (blocking executions) has two problems when you have a large number of concurrent executions. First, you can hit the regional throttling limit on the API call. ListExecutions Second, if you have configured timeout on your state machine (and you should!) then they can also timeout. This creates backpressure on the system. Approach 1 (with SQS) is far more scalable by comparison. Queued tasks are not started until they are allowed to start so no backpressure. Only the cron Lambda function needs to list executions, so you’re also unlikely to reach API limits. Simplicity Approach 1 introduces new pieces to the infrastructure — SQS, CloudWatch schedule and Lambda. Also, it forces the producers to change as well. With approach 2, a new Lambda function is needed for the additional step, but it’s part of the state machine. Cost Approach 1 introduces minimal baseline cost even when there are no executions. However, we are talking about cents here… Approach 2 introduces additional state transitions, which is around $25 per million. See the page for more details. Since each execution will incur 3 transitions per minute whilst it’s blocked, the cost of these transitions can pile up quickly. Step Functions pricing Conclusions Given the two approaches we considered here, using SQS is by far the more scalable. It is also more cost effective as the number of concurrent executions goes up. But, you need to manage additional infrastructure and force upstream systems to change. This can impact other teams, and ultimately affects your ability to deliver on time. If you do not expect a high number of executions, then you might be better off going with the second approach. Hi, my name is . I’m an and the author of . I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless. Yan Cui AWS Serverless Hero Production-Ready Serverless You can contact me via , and . Email Twitter LinkedIn Check out my new course, . Complete Guide to AWS Step Functions In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices. Get your copy . here Come learn about operational for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more. BEST PRACTICES You can also get off the face price with the code . 40% ytcui Get your copy . here
Share Your Thoughts