getRemainingTimeInMillis() gives you a flexible way to timeout requests on the client-side based on the amount of invocation time left rather than some arbitrarily hardcoded value.
With API Gateway and Lambda, you’re forced to use relatively short timeouts on the server-side:
However, as you have limited influence over a Lambda function’s cold start time and have no control over the amount of latency overhead API Gateway introduces, the actual client-facing latency you’d experience from a calling function is far less predictable.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/api-gateway-metrics-dimensions.html
To prevent slow HTTP responses from causing the calling function to timeout (and therefore impact the user experience we offer) we should make sure we stop waiting for a response before the calling function times out.
“the goal of the timeout strategy is to give HTTP requests the best chance to succeed, provided that doing so does not cause the calling function itself to err” — me
Most of the time, I see folks use fixed (either hard coded or specified via config) timeout values, which is often tricky to decide:
This challenge of choosing the right timeout value is further complicated by the fact that we often perform more than one HTTP request during a function invocation — e.g. read from DynamoDB, talk to some internal API, then save changes to DynamoDB equals a total of 3 HTTP requests in one invocation.
Let’s look at two common approaches for picking timeout values and scenarios where they fall short of meeting our goal.
requests are not given the best chance to succeed
requests are allowed too much time to execute and caused the function to timeout.
Instead, we should set the request timeout based on the amount of invocation time left, whilst taking into account the time required to perform any recovery steps — e.g. return a meaningful error with application specific error code in the response body, or return a fallback result instead.
You can easily find out how much time is left in the current invocation through the context
object your function is invoked with.
https://docs.aws.amazon.com/lambda/latest/dg/nodejs-prog-model-context.html
For example, if a function’s timeout
is 6s, but by the time you make the HTTP request you’re already 1s into the invocation (perhaps you had to do some expensive computation first), and if we reserve 500ms for recovery, then that leaves us with 4.5s to wait for HTTP response.
With this approach, we get the best of both worlds:
requests are given the best chance to succeed, without being restricted by an arbitrarily determined timeout.
slow responses are timed out before they cause the calling function to time out
But what are you going to do AFTER you time out these requests? Aren’t you still going to have to respond with a HTTP error since you couldn’t finish whatever operations you needed to perform?
At the minimum, the recovery actions should include:
serviceX.timedout
so it can be monitored and the team can be alerted if the situation escalates
{"errorCode": 10021,"requestId": "f19a7dca","message": "service X timed out"}
In some cases, you can also recover even more gracefully using fallbacks.
Netflix’s Hystrix library, for instance, supports several flavours of fallbacks via the Command pattern it employs so heavily. In fact, if you haven’t read its wiki page already then I strongly recommend that you go and give it a thorough read, there are tons of useful information and ideas there.
At the very least, every command lets you specify a fallback action.
You can also chain the fallback together by chaining commands via their respective getFallback
methods.
For example,
CommandA
getFallback
method, execute CommandB
which would return a previously cached response if availableCommandB
would fail, and trigger its own getFallback
methodCommandC
, which returns a stubbed responseAnyway, check out Hystrix if you haven’t already, most of the patterns that are baked into Hystrix can be easily adopted in our serverless applications to help make them more resilience to failures — something that I’m actively exploring with a separate series on applying principles of chaos engineering to Serverless.
As an aside, as Danilo mentioned in the comments, you can also use context.getRemainingTimeInMillis()
to decide when to do more work vs recurse when writing recursive functions. You can read more about that as well as other tips for writing recursive Lambda functions in this post.
Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
You can contact me via Email, Twitter and LinkedIn.
Check out my new course, Complete Guide to AWS Step Functions.
In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.
Get your copy here.
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
You can also get 40% off the face price with the code ytcui.
Get your copy here.