Yan Cui

AWS Serverless Hero. Independent Consultant. Developer Advocate at Lumigo.

Step Functions: how to implement semaphores for state machines

How do you control the number of concurrent state machine executions that can access a shared resource?

Here at DAZN, we are migrat­ing from our lega­cy plat­form into a brave new world of microfron­tends and microser­vices. Along the way, we also dis­cov­ered the delights that AWS Step Func­tion has to offer, for exam­ple…

  • flex­i­ble error han­dling and retry
  • the under­stat­ed abil­i­ty to wait between tasks
  • the abil­i­ty to mix auto­mat­ed steps with activ­i­ties that require human interven­tion

In some cas­es, we need to con­trol the num­ber of con­cur­rent state machine exe­cu­tions that can access a shared resource. This might be a busi­ness require­ment. Or it could be due to scal­a­bil­i­ty con­cerns for the shared resource. It might also be a result of the design of our state machine which makes it dif­fi­cult to par­al­lelise.

We came up with a few solu­tions that fall into two gen­er­al cat­e­gories:

  1. Con­trol the num­ber of exe­cu­tions that you can start
  2. Allow con­cur­rent exe­cu­tions to start, but block an exe­cu­tion from enter­ing the crit­i­cal path until it’s able to acquire a sem­a­phore (i.e. a sig­nal to proceed)

Control the number of concurrent executions

You can con­trol the MAX num­ber of con­cur­rent exe­cu­tions by intro­duc­ing a SQS queue. A Cloud­Watch sched­ule will trig­ger a Lamb­da func­tion to…

  1. check how many con­cur­rent exe­cu­tions there are
  2. if there are N exe­cu­tions, then we can start MAX-N exe­cu­tions
  3. poll SQS for MAX-N mes­sages, and start a new exe­cu­tion for each

We’re not using the new SQS trig­ger for Lamb­da here because the pur­pose is to slow down the cre­ation of new exe­cu­tions. Where­as the SQS trig­ger would push tasks to our Lamb­da func­tion eager­ly.

Also, you should use a FIFO queue so that tasks are processed in the same order they’re added to the queue.

Block execution using semaphores

You can use the Lis­tEx­e­cu­tions API to find out how many exe­cu­tions are in the RUNNING state. You can then sort them by start­Date and only allow the eldest exe­cu­tions to tran­si­tion to states that access the shared resource.

Take the fol­low­ing state machine for instance.

The Only­One­Shall­RunA­tOne­Time state invokes the one-shall-pass Lambda func­tion and returns a proceed flag. The Shall Pass? state then branch­es the flow of this exe­cu­tion based on the proceed flag.

OnlyOneShallRunAtOneTime:
Type: Task
Resource: arn:aws:lambda:us-east-1:xxx:function:one-shall-pass
Next: Shall Pass?
Shall Pass?:
Type: Choice
Choices:
- Variable: $.proceed # check if this execution should proceed
BooleanEquals: true
Next: SetWriteThroughputDeltaForScaleUp
Default: WaitToProceed # otherwise wait and try again later WaitToProceed:
Type: Wait
Seconds: 60
Next: OnlyOneShallRunAtOneTime

The tricky thing here is how to asso­ciate the Lamb­da invo­ca­tion with the corre­spond­ing Step Func­tion exe­cu­tion. Unfor­tu­nate­ly, Step Func­tions does not pass the exe­cu­tion ARN to the Lamb­da func­tion. Instead, we have to pass the exe­cu­tion name as part of the input when we start the exe­cu­tion.

const name = uuid().replace(/-/g, '_')
const input = JSON.stringify({ name, bucketName, fileName, mode }) const req = { stateMachineArn, name, input }
const resp = await SFN.startExecution(req).promise()

When the one_shall_pass func­tion runs, it can use the exe­cu­tion name from the input. It’s then able to match the invo­ca­tion against the exe­cu­tions returned by Lis­tEx­e­cu­tions.

In this par­tic­u­lar case, only the eldest exe­cu­tion can pro­ceed. All other executions would tran­si­tion to the Wait­To­Pro­ceed state.

module.exports.handler = async (input, context) => {
const executions = await listRunningExecutions()
Log.info(`found ${executions.length} RUNNING executions`)
const oldest = _.sortBy(executions, x => x.startDate.getTime())[0]     
Log.info(`the oldest execution is [${oldest.name}]`)
if (oldest.name === input.name) {
return { ...input, proceed: true }
} else {
return { ...input, proceed: false }
}
}

Compare the approaches

Let’s com­pare the two approach­es against the fol­low­ing cri­te­ria:

  • Scal­a­bil­i­ty. How well does the approach cope as the num­ber of con­cur­rent exe­cu­tions goes up?
  • Sim­plic­i­ty. How many mov­ing parts does the approach add?
  • Cost. How much extra cost does the approach add?

Scalability

Approach 2 (block­ing exe­cu­tions) has two prob­lems when you have a large num­ber of con­cur­rent exe­cu­tions.

First, you can hit the region­al throt­tling lim­it on the ListExecutions API call.

Sec­ond, if you have con­fig­ured time­out on your state machine (and you should!) then they can also time­out. This cre­ates back­pres­sure on the sys­tem.

Approach 1 (with SQS) is far more scal­able by com­par­i­son. Queued tasks are not start­ed until they are allowed to start so no back­pres­sure. Only the cron Lamb­da func­tion needs to list exe­cu­tions, so you’re also unlike­ly to reach API lim­its.

Simplicity

Approach 1 intro­duces new pieces to the infra­struc­ture — SQS, Cloud­Watch sched­ule and Lamb­da. Also, it forces the pro­duc­ers to change as well.

With approach 2, a new Lamb­da func­tion is need­ed for the addi­tion­al step, but it’s part of the state machine.

Cost

Approach 1 intro­duces min­i­mal base­line cost even when there are no exe­cu­tions. How­ev­er, we are talk­ing about cents here…

Approach 2 intro­duces addi­tion­al state tran­si­tions, which is around $25 per mil­lion. See the Step Func­tions pric­ing page for more details. Since each exe­cu­tion will incur 3 tran­si­tions per minute whilst it’s blocked, the cost of these tran­si­tions can pile up quick­ly.

Conclusions

Giv­en the two approach­es we con­sid­ered here, using SQS is by far the more scal­able. It is also more cost effec­tive as the num­ber of con­cur­rent exe­cu­tions goes up.

But, you need to man­age addi­tion­al infra­struc­ture and force upstream sys­tems to change. This can impact oth­er teams, and ulti­mate­ly affects your abil­i­ty to deliv­er on time.

If you do not expect a high num­ber of exe­cu­tions, then you might be bet­ter off going with the sec­ond approach.

Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.

You can contact me via Email, Twitter and LinkedIn.

Check out my new course, Complete Guide to AWS Step Functions.

In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.

Get your copy here.

Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.

You can also get 40% off the face price with the code ytcui.

Get your copy here.

More by Yan Cui

Topics of interest

More Related Stories