When a Lambda function invocation crashes due to an uncaught application error, for example, AWS may automatically retry the same request. This auto-retry happens when the function is invoked: Asynchronously by any type of requester By an , such as data streams from or event source mapping DynamoDB Kinesis Invocations that are direct (using the ) and synchronous will not trigger the auto-retry in case of an error. Invoke API endpoint AWS Services that call a Lambda function synchronously will also not trigger the auto-retry, but these services may have their own logic to retry the request regardless of Lambda’s internal one. What’s wrong with the retry behavior anyway? In some cases, it’s just pointless to retry. If the first request failed, it has zero chance of succeeding in subsequent attempts. Retrying would be a waste of resources here. Even worse, retrying could also lead to unwanted side-effects if our code is not (I recently about this, if you’d like to dig deeper). You could end up with duplicate database entries or multiples of the same business transactions, for instance, which can be a serious issue. idempotent published an article How do we avoid it? Quick observation: the retry behavior is not evil, we might want it in many cases. The purpose of this article is for those cases we would not like our functions to retry. Since AWS Lambda does not offer us a way to disable the automatic retries, there is a simple-to-implement strategy to avoid it: handling errors globally You will find demonstrations on how to use to avoid auto-retries. Personally, I find this solution cumbersome and way too complex, adding overhead not only to the implementation but also to debugging and monitoring. Step Functions Global error handling Each Lambda function has only one entry point, its handler function, which makes it easier to catch errors globally. You could either do it inside the event handler or use a middleware (such as ) to handle it. Middy The direct way would be something like this: logging my_code my_event_handler logger = logging.getLogger() logger.setLevel(logging.WARNING) : response = my_event_handler(event=event, context=context) Exception error: logger.exception(error) response = { : , : { : type(error).__name__, : str(error), }, } : response import from import : def lambda_handler (event, context) try except as 'status' 500 'error' 'type' 'description' finally return We used Python in the example but the same result can be achieved in any modern programming language. A cleaner way to implement this in Python would be by using a function decorator, but not all languages will offer such a feature and that’s why we avoided it in the example. What we are doing is delegating the handling of the event to another function , while the works as (kind of) middleware to handle any exceptions. Since all possible exceptions are handled and we then respond with a valid JSON object, Lambda will not retry any request. my_event_handler lambda_handler The line is crucial here. It makes sure the error is logged for later investigation and debugging. Omitting this part is very dangerous because any and all errors will go silent, you will never even know that they happened. By logging the error, you will be able to view it on CloudWatch. logger.exception(error) Third-party monitoring services like , or should also allow you to be alerted by email, Slack or other channels whenever things go south, so you can promptly act on any issue. This is highly recommended for professional applications running in production. Dashbird DataDog Sentry Lastly, in case the response is going out to third-parties, you probably will want to omit the error details in the response object for security reasons. Deciding when to trigger auto-retry The nice part of this implementation is that we can prepare our function to decide when we want retries to occur or not. We can declare our own custom exceptions with a signature that can tell our Lambda function that a retry would come in handy in case they are raised. Here’s how we would do it: First, declare a parent exception that will act as the “retry is nice” signature: : class WorthRetryingException (Exception) pass Now, inheriting from the one above, declare exceptions that will be raised by your application in case something goes wrong: : class ExternalAPIFailed (WorthRetryingException) pass : class SomeTemporaryGlitch (WorthRetryingException) pass In your application logic, you can raise your own custom exceptions this way: : something() SomeError error: SomeTemporaryGlitch() error : def sometimes_it_fails (*args, **kwargs) try except as raise from Then, in our `lambda_handler`, we should change the error handling this way: (...) Exception error: isinstance(error, WorthRetryingException): error : logger.exception(error) response = { : , : { : type(error).__name__, : str(error), }, } (...) except as if raise else 'status' 500 'error' 'type' 'description' It’s a simple change that empowers our function to behave differently depending on the type of exception raised. When the error is worth retrying, will raise it, triggering the auto-retry feature within Lambda. Otherwise, it will only log the exception and respond in a nice way, supressing retries. lambda_function How about other causes of failure? Our Lambda functions can fail from causes other than application errors. Among those are timeouts, lack of memory, early exits. They will not fall into the exception handling logic above, so we need to do something else in case we want to “catch” them. Handling Timeouts Make sure you are handling well all of the IO-bound routines. In case of an external HTTP request or a Database connection, for example, add a timeout parameter to the request that is obviously lower than your function timeout. This way we already avoid some of the most common timeout causes. If a CPU-bound process may take too long to finish and cause a Lambda timeout error, there are three possible things we can do to help it: Increase the timeout (if not already set at the Lambda maximum); Increase memory configuration; Monitor the remaining time and terminate execution before a timeout; More memory allocated to your function may speed up the processing (read for more info about how to optimize CPU-intensive tasks on Lambda). this article The remaining time is available in the object, an argument provided to our along with the payload. In Python functions, for example, the object . context lambda_handler event context looks like this What we could do is run a parallel thread that checks the value of every few seconds. Before it reaches zero, this thread would raise an error (something like ). context.get_remaining_time_in_millis() AlmostTimingOutError This error should be handled by our , obviously, not to trigger an auto-retry and to properly log the occurrence of the premature halt of the execution. lambda_handler Handling memory errors and early exits These two types of errors aren’t actually possible to handle within our function. What we can do is try to avoid the errors themselves or properly deal with an auto-retry in case it happens. To avoid memory errors, benchmark your function across different scenarios and allocate memory with some comfortable margin to minimize chances it will run out of resources. In case it happens anyway, there’s a way to couple with the auto-retry we’ll discuss shortly after. Early exits, on the other hand, are difficult. They can happen for a variety of reasons. It could be something related to your application, yes, but I’m referring to things you actually can’t handle in the app code. I experienced this issue once when I was running a deep learning algorithm in Lambda that was compiled with a limitation for a specific type of CPU. It only manifested when Lambda would launch a container in a server that didn’t have this type of processor. It took me a few days to find out and solve the issue. When this type of error happens, it’s impossible to handle within our application code. Even adding a doesn’t help to prevent the Lambda function from crashing and avoid an auto-retry. global error handler Using Lambda Request ID For those cases, there’s a last resort strategy we could use, which is identifying and halting invocations that are the result of an auto-retry. We can do this by using the request ID that comes, again, with the context object. In the , this is provided in the object attribute. Python version aws_request_id The disadvantage of this implementation is that it adds cost and latency to all of your Lambda executions. You will need to store all request IDs in an external database (I would recommend ). DynamoDB At the very beginning of each invocation, you check whether the request ID already exists in the DB. If yes, it means this is a retry invocation, since Lambda will always use the same original request ID for all retry attempts. The latency added by DynamoDB would be around 10 to 50 milliseconds. If you use their cache service , latency could be lowered to single-digit milliseconds. DynamoDB is scalable, requires almost zero maintenance and is relatively cheap: $0.00065 per Write and $0.00013 per Read. In our case, we would consume one of each, totaling $0.00078 per Lambda execution - except when a retry occurs, when we would consume only one read, no write. DAX Wrapping up We have covered a few strategies and best practices to follow in order to avoid your Lambda functions from triggering an auto-retry invocation. Nevertheless, the safest approach is always to , since this would avoid any unwanted side-effects from running the same payload request multiple times. make your functions idempotent Handling errors like a Pro I thought it would be fair to state that this article was a product of our experience at , a monitoring and debugging platform for serverless applications. If you or your company are using AWS Lambda and want to track application errors in a smart way, you should definitely (it’s free up to 1M requests and does not require credit card). Dashbird.io check it out Disclosure: I work as a Developer Advocate at Dashbird.io promoting serverless best-practices in general and the company service.