Incorporate AWS Fargate into Step Functions

Steps Functions is a great way to orchestrate a serverless workflow. Long-running processes are often inevitable due to the execution time limit of Lambda functions, and they usually rely on an “always-on” infrastructure. How could you integrate these long-running processes into your workflow while still remain serverless?

People have already heard of, or used AWSStep Functions to coordinate cloud native tasks (i.e. Lambda functions) to handle part/all of their production workloads. In this post, I will walk through how to incorporate non-lambda tasks into a state machine in Step Functions.

At a high level, what exactly is Step Functions?

This is probably the first question that everyone new to Step Functions will ask. After a bit of a play around it, I conclude it as this:

It is simply an orchestration tool that lets you compose a workflow from your distributed, individually managed services.

So, Step Functions, unlike what the name may suggest, are not actual functions that you need to write in order to link your application’s components together. At its core is a state machine, or more precisely, a finite state machine, which is a way of illustrating your application’s flow by defining individual states and transitions between these states. In practice, a state is usually fated to perform a task. This is how AWS defines a task:

A task can be an activity or a Lambda function.

Well, we all know what a Lambda function is, right? So what on earth is an activity? Here we go:

Activities are an AWS Step Functions feature that enables you to have a task in your state machine where the work is performed by a worker that can be hosted on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), mobile devices — basically anywhere.

Simply put, an activity is a symbol, which has to be realised by a worker that does the actual work. The worker is basically your individually managed service, which can be hosted anywhere as documented, as long as it can connect to Step Functions over HTTP. The worker will have to poll for the activity to check whether the state machine has reached this state. Once the activity in this state has been activated, the worker will start to work and then report back to that state for the state machine to carry on.

So, how does Fargate fit in all this?

At this point, people might ask, why couldn’t AWS just trigger an elsewhere-hosted task right from within the state machine, like it does Lambda functions? Good question. That’s one of the things in my wish list for AWS. I don’t have the answer, but I guess it’s due to the mobility/unknowns of the worker, which would make it really tricky to define your states.

As mentioned earlier on, a worker must poll for a task in a state before it can execute. This essentially means you have to host it somewhere to maintain a socket to Step Functions in order to kick off when asked to. If you are dedicated to AWS, chances are you have got your workers hosted on EC2 or in containers on ECS clusters. The main drawback of that is you may end up paying for resources when your worker is off-duty. In an attempt to save cost, many people end up having to wire Lambda functions before and after worker execution, thereby introducing more states whose sole job is to warm up/decommission underlying infrastructure.

That, was a pain until the release of Fargate. This, in my opinion, is a game changer to running many things in AWS. It’s basically a serverless ECS offering, with which you don’t pay for any resources outside your container’s lifecycle. Which also means your container stays off until something wakes it up.

How would you implement this in practice?

So if the container is off by default, how does it poll for a task in Step Functions?

Asked one of my colleagues. I sort of already answered this question just now. If you knew how to warm up an ECS cluster to run your ECS worker, you surely would have known how to wake up a “fargated” container by using AWS SDK API. So the way I implemented it was by introducing one extra state just before the state in which the ECS worker is specified. This extra state will run a task in a Lambda function, which runs the container. Once this Lambda function finishes, the state machine proceeds to the next state which is what the triggered container will be polling for. I have extracted this part from my production use case and simplified it into the following diagram.

The task in trigger-ecs state runs a Lambda function, whose job is invoking runTask API to wake up and run the ECS container as shown below

exports.handler = async (event) => {

var params = {  
    taskDefinition: 'step-functions-task',  
    cluster: 'ecs-test',  
    launchType: 'FARGATE',  
    networkConfiguration: {  
        awsvpcConfiguration: {  
            subnets: \['my-subnet'\],  
            assignPublicIp: 'ENABLED'  
        }  
    }  
}  
  
await ecs.runTask(params).promise()  
  
console.log('ECS has been triggered.')

};

One thing to note in above Lambda function is that assignPublicIp must be set to ENABLED for it to work, as the fargated container needs a public IP address to establish a HTTP connection to Step Functions for polling.

Upon the completion of this Lambda function, the state machine transitions to the next state execute-ecs. After the container has warmed up and started running, the worker will receive a task token from the execute-ecs state knowing it’s ready for action. The code below in NodeJS shows how a worker polls for a task and report back after it’s completed its job.

var AWS = require('aws-sdk')

var options = {'region': 'ap-southeast-2'}

var client = new AWS.StepFunctions(options);

var getStepFunctionActivity = () => {let params = {activityArn: 'arn:aws:states:ap-southeast-2:<account-id>:activity:run-ecs-task'}return client.getActivityTask(params).promise()}

var reportBack = (token) => {let params = {taskToken: token,output: JSON.stringify({result: 'ecs done'})}return client.sendTaskSuccess(params).promise()}

var executeTask = () => {return new Promise(resolve => {setTimeout(() => {resolve('job done')}, 10000)})}

var execute = async () => {while (true) {var task = await getStepFunctionActivity()if (task.taskToken) {await executeTask()await reportBack(task.taskToken)break}}}

execute()

Noticed how I simulated executeTask function to return in 10 seconds? This is where you’d put your real logic in. At the end of the execution, the code will report back to the state machine with a json string output. Below is the execution history of this whole state machine.

Notice there’s a roughly 10-second gap between 9 and 10? That was the execution of the Fargate container. ID 10 also shows the output from the container execution.

That’s it. Hope you’ve enjoyed this walk-through. I’d appreciate any feedback and be happy to be pointed where I’ve got anything wrong.