Renato Byrro

Python Developer, Serverless enthusiast and Developer Advocate

Simple Steps to Avoid the Retry Behavior from AWS Lambda

When a Lambda function invocation crashes due to an uncaught application error, for example, AWS may automatically retry the same request.
This auto-retry happens when the function is invoked:
Asynchronously by any type of requester
By an event source mapping, such as data streams from DynamoDB or Kinesis
Invocations that are direct (using the Invoke API endpoint) and synchronous will not trigger the auto-retry in case of an error.
AWS Services that call a Lambda function synchronously will also not trigger the auto-retry, but these services may have their own logic to retry the request regardless of Lambda’s internal one.

What’s wrong with the retry behavior anyway?

In some cases, it’s just pointless to retry. If the first request failed, it has zero chance of succeeding in subsequent attempts. Retrying would be a waste of resources here.
Even worse, retrying could also lead to unwanted side-effects if our code is not idempotent (I recently published an article about this, if you’d like to dig deeper). You could end up with duplicate database entries or multiples of the same business transactions, for instance, which can be a serious issue.

How do we avoid it?

Quick observation: the retry behavior is not evil, we might want it in many cases. The purpose of this article is for those cases we would not like our functions to retry.
Since AWS Lambda does not offer us a way to disable the automatic retries, there is a simple-to-implement strategy to avoid it: handling errors globally
You will find demonstrations on how to use Step Functions to avoid auto-retries. Personally, I find this solution cumbersome and way too complex, adding overhead not only to the implementation but also to debugging and monitoring.

Global error handling

Each Lambda function has only one entry point, its handler function, which makes it easier to catch errors globally.
You could either do it inside the event handler or use a middleware (such as Middy) to handle it.
The direct way would be something like this:
import logging
from my_code import my_event_handler

logger = logging.getLogger()
logger.setLevel(logging.WARNING)

def lambda_handler(event, context):
    try:
        response = my_event_handler(event=event, context=context)

    except Exception as error:
        logger.exception(error)
        response = {
            'status': 500,
            'error': {
                'type': type(error).__name__,
                'description': str(error),
            },
        }

    finally:
        return response
We used Python in the example but the same result can be achieved in any modern programming language.
A cleaner way to implement this in Python would be by using a function decorator, but not all languages will offer such a feature and that’s why we avoided it in the example.
What we are doing is delegating the handling of the event to another function
my_event_handler
, while the
lambda_handler
works as (kind of) middleware to handle any exceptions. Since all possible exceptions are handled and we then respond with a valid JSON object, Lambda will not retry any request.
The line
logger.exception(error)
is crucial here. It makes sure the error is logged for later investigation and debugging. Omitting this part is very dangerous because any and all errors will go silent, you will never even know that they happened. By logging the error, you will be able to view it on CloudWatch.
Third-party monitoring services like Dashbird, DataDog or Sentry should also allow you to be alerted by email, Slack or other channels whenever things go south, so you can promptly act on any issue. This is highly recommended for professional applications running in production.
Lastly, in case the response is going out to third-parties, you probably will want to omit the error details in the response object for security reasons.

Deciding when to trigger auto-retry

The nice part of this implementation is that we can prepare our function to decide when we want retries to occur or not.
We can declare our own custom exceptions with a signature that can tell our Lambda function that a retry would come in handy in case they are raised.
Here’s how we would do it:
First, declare a parent exception that will act as the “retry is nice” signature:
class WorthRetryingException(Exception):
    pass
Now, inheriting from the one above, declare exceptions that will be raised by your application in case something goes wrong:
class ExternalAPIFailed(WorthRetryingException):
    pass

class SomeTemporaryGlitch(WorthRetryingException):
    pass
In your application logic, you can raise your own custom exceptions this way:
def sometimes_it_fails(*args, **kwargs):
    try:
        something()

    except SomeError as error:
        raise SomeTemporaryGlitch() from error
Then, in our `lambda_handler`, we should change the error handling this way:
(...)

except Exception as error:
    if isinstance(error, WorthRetryingException):
        raise error

    else:
        logger.exception(error)
        response = {
            'status': 500,
            'error': {
                'type': type(error).__name__,
                'description': str(error),
            },
        }

(...)
It’s a simple change that empowers our function to behave differently depending on the type of exception raised.
When the error is worth retrying,
lambda_function
will raise it, triggering the auto-retry feature within Lambda. Otherwise, it will only log the exception and respond in a nice way, supressing retries.

How about other causes of failure?

Our Lambda functions can fail from causes other than application errors. Among those are timeouts, lack of memory, early exits.
They will not fall into the exception handling logic above, so we need to do something else in case we want to “catch” them.

Handling Timeouts

Make sure you are handling well all of the IO-bound routines. In case of an external HTTP request or a Database connection, for example, add a timeout parameter to the request that is obviously lower than your function timeout. This way we already avoid some of the most common timeout causes.
If a CPU-bound process may take too long to finish and cause a Lambda timeout error, there are three possible things we can do to help it:
  1. Increase the timeout (if not already set at the Lambda maximum);
  2. Increase memory configuration;
  3. Monitor the remaining time and terminate execution before a timeout;
More memory allocated to your function may speed up the processing (read this article for more info about how to optimize CPU-intensive tasks on Lambda).
The remaining time is available in the
context
object, an argument provided to our
lambda_handler
along with the
event
payload. In Python functions, for example, the
context
object looks like this.
What we could do is run a parallel thread that checks the value of
context.get_remaining_time_in_millis()
every few seconds. Before it reaches zero, this thread would raise an error (something like
AlmostTimingOutError
).
This error should be handled by our
lambda_handler
, obviously, not to trigger an auto-retry and to properly log the occurrence of the premature halt of the execution.

Handling memory errors and early exits

These two types of errors aren’t actually possible to handle within our function. What we can do is try to avoid the errors themselves or properly deal with an auto-retry in case it happens.
To avoid memory errors, benchmark your function across different scenarios and allocate memory with some comfortable margin to minimize chances it will run out of resources. In case it happens anyway, there’s a way to couple with the auto-retry we’ll discuss shortly after.
Early exits, on the other hand, are difficult. They can happen for a variety of reasons. It could be something related to your application, yes, but I’m referring to things you actually can’t handle in the app code.
I experienced this issue once when I was running a deep learning algorithm in Lambda that was compiled with a limitation for a specific type of CPU. It only manifested when Lambda would launch a container in a server that didn’t have this type of processor. It took me a few days to find out and solve the issue.
When this type of error happens, it’s impossible to handle within our application code. Even adding a global error handler doesn’t help to prevent the Lambda function from crashing and avoid an auto-retry.

Using Lambda Request ID

For those cases, there’s a last resort strategy we could use, which is identifying and halting invocations that are the result of an auto-retry.
We can do this by using the request ID that comes, again, with the context object. In the Python version, this is provided in the
aws_request_id
object attribute.
The disadvantage of this implementation is that it adds cost and latency to all of your Lambda executions. You will need to store all request IDs in an external database (I would recommend DynamoDB).
At the very beginning of each invocation, you check whether the request ID already exists in the DB. If yes, it means this is a retry invocation, since Lambda will always use the same original request ID for all retry attempts.
The latency added by DynamoDB would be around 10 to 50 milliseconds. If you use their cache service DAX, latency could be lowered to single-digit milliseconds. DynamoDB is scalable, requires almost zero maintenance and is relatively cheap: $0.00065 per Write and $0.00013 per Read. In our case, we would consume one of each, totaling $0.00078 per Lambda execution - except when a retry occurs, when we would consume only one read, no write.

Wrapping up

We have covered a few strategies and best practices to follow in order to avoid your Lambda functions from triggering an auto-retry invocation. Nevertheless, the safest approach is always to make your functions idempotent, since this would avoid any unwanted side-effects from running the same payload request multiple times.

Handling errors like a Pro

I thought it would be fair to state that this article was a product of our experience at Dashbird.io, a monitoring and debugging platform for serverless applications. If you or your company are using AWS Lambda and want to track application errors in a smart way, you should definitely check it out (it’s free up to 1M requests and does not require credit card).
Disclosure: I work as a Developer Advocate at Dashbird.io promoting serverless best-practices in general and the company service.

Tags

Comments

July 25th, 2019

green imagery looking fly.

More by Renato Byrro

Topics of interest