This is part 2 of a multipart series that explores ideas on how we could apply the principles of chaos engineering to serverless architectures built around Lambda functions.
The most common issue I have encountered in production are latency/performance related. They can be symptomatic of a host of underlying causes ranging from AWS network issues (which can also manifest itself in latency/error-rate spikes in any of the AWS services), overloaded servers, to simple GC pauses.
Latency spikes are inevitable — as much as you can improve the performance of your application, things will go wrong, eventually, and often they’re out of your control.
So you must design for them, and degrade the quality of your application gracefully to minimize the impact on your users.
In the case of API Gateway and Lambda, there are additional considerations:
If you use moderate timeout settings for your API functions (and you should!) then you need to consider the effects of cold starts when calling an intermediate service.
Suppose our client application communicates directly with 2 public facing APIs, whom in turn depends on an internal API.
In this setup, I can think of 3 places where we can inject latency and each would validate a different hypothesis.
The first, and easiest place to inject latency is in the HTTP client library we use to communicate with the internal API.
This will test that our function has appropriate timeout on this HTTP communication and can degrade gracefully when this request time out.
We can inject latency to the HTTP client libraries for our internal APIs, hence validating that the caller function has configured appropriate timeout and error handling for timeouts.
Furthermore, this practice should also be applied to other 3rd party services we depend on, such as DynamoDB. We will discuss how we can inject latency to these 3rd party libraries later in the post.
We can also inject latency to 3rd party client libraries for other managed services we depend on.
This is a reasonably safe place to inject latency as the immediate blast radius is limited to this function.
However, you can (and arguably, should) consider applying this type of latency injection to intermediate services as well. Doing so does carry extra risk as it has a broader blast radius in failures cases — ie. if the function under test does not degrade gracefully then it can cause unintended problems to outer services. In this case, the blast radius for these failure cases is the same as if you’re injecting latency to the intermediate functions directly.
You can also inject latency directly to the functions themselves (we’ll look at how later on). This has the same effect as injecting latency to the HTTP client to each of its dependents, except it’ll affect all its dependents at once.
We can inject latency to a function’s invocation. If that function is behind an internal API that are used by multiple public-facing APIs then it can cause all its dependents to experience timeouts.
This might seem risky (it can be), but is an effective way to validate that every service that depends on this API endpoint is expecting, and handling timeouts gracefully.
It makes most sense when applied to intermediate APIs that are part of a bounded context (or, a microservice), maintained by the same team of developers. That way, you avoid unleashing chaos upon unsuspecting developers who might not be ready to deal with the chaos.
That said, I think there is a good counter-argument for doing just that.
We often fall into the pitfall of using the performance characteristics of dev environments as predictor for the production environment. Whilst we seldom experience load-related latency problems in the dev environments — because we don’t have enough load in those environments to begin with — production is quite another story. Which means, we’re not programmed to think about these failure modes during development.
So, a good way to hack the brains of your fellow developers and to programme them to expect timeouts is to expose them to these failure modes regularly in the dev environments, by injecting latency to our internal APIs in those environments.
In fact, if we make our dev environments exhibit the most hostile and turbulent conditions that our systems should be expected to handle, then we know for sure that any system that makes its way to production are ready to face what awaits it in the wild.
We can figuratively hold up a sign and tell other developers to expect latency spikes and timeouts by literally exposing them to these scenarios in dev environments, regularly, so they know to expect it.
So far, we have focused on validating the handling of latency spikes and timeouts in our APIs. The same validation is needed for our client application.
We can apply all the same arguments mentioned above here. By injecting latency to our public-facing API functions (in both production as well as dev environments), we can:
When I was working on a MMORPG at Gamesys years ago, we uncovered a host of frailties in the game when we injected latency spikes and faults to our APIs. The game would crash during startup if any of the first handful of requests fails. In some cases, if the response time was longer than a few seconds then the game would also get into a weird state because of race conditions.
Turns out I was setting my colleagues up for failure in production because the dev environment was so forgiving and gave them a false sense of comfort.
With that, let’s talk about how we can apply the practice of latency injection.
But wait, can’t you inject latency in the client-side HTTP clients too?
Absolutely! And you should! However, for the purpose of this post we are going to look at how and where we can inject latency to our Lambda functions only, hence why I have willfully ignored this part of the equation.
There are 2 aspects to actually injecting latencies:
If you read my previous posts on capturing and forwarding correlation IDs and managing configurations with SSM Parameter Store, then you have already seen the basic building blocks we need to do both.
Since you are unlikely to write a HTTP client from scratch, so I consider the problem for injecting latency to HTTP client and 3rd party clients (such as the AWS SDK) to be one and the same.
A couple of solutions jump to mind:
promisifyAll
function is a good example)Since I’m going to use Node.js as example, I’m going to focus on wrappers.
For the HTTP client, given the relatively small number of methods you will need, it’s feasible to craft the wrapper by hand, especially if you have a particular API design in mind.
Using the HTTP client I created for the correlation ID post as base, I modified it to accept a configuration object to control the latency injection behaviour.
{"isEnabled": true,"probability": 0.5,"minDelay": 100,"maxDelay": 5000}
You can find this modified HTTP client here, below is a simplified version of this client (which uses superagent
under the hood).
To configure the function and the latency injection behaviour, we can use the configClient
I first created in the SSM Parameter Store post.
First, let’s create the configs in the SSM Parameter Store.
You can create and optionally encrypt parameter values in the SSM Parameter Store.
The configs contains the URL for the internal API, as well as a chaosConfig
object. For now, we just have a httpClientLatencyInjectionConfig
property, which is used to control the HTTP client’s latency injection behaviour.
{"internalApi": "https://xx.amazonaws.com/dev/internal","chaosConfig": {"httpClientLatencyInjectionConfig": {"isEnabled": true,"probability": 0.5,"minDelay": 100,"maxDelay": 5000}}}
Using the aforementioned configClient
, we can fetch the JSON config from SSM Parameter Store at runtime.
const configKey = "public-api-a.config";const configObj = configClient.loadConfigs([ configKey ]);
let config = JSON.parse(yield configObj["public-api-a.config"]);let internalApiUrl = config.internalApi;let chaosConfig = config.chaosConfig || {};let injectionConfig = chaosConfig.httpClientLatencyInjectionConfig;
let reply = yield http({method : 'GET',uri : internalApiUrl,latencyInjectionConfig: injectionConfig});
The above configuration gives us a 50% chance of injecting a latency between 100ms and 5sec when we make the HTTP request to internal-api
.
This is reflected in the following X-Ray traces.
From this X-Ray trace, you can see that 287ms of latency was injected to the HTTP request.
In this trace, you see the amount of latency injected was a hefty 3.4 sec!
From this X-Ray trace, you can see no latency was injected to the HTTP request, so the overall invocation time was also very small by comparison.
With the AWS SDK, it’s not feasible to craft the wrapper by hand. Instead, we could do with a factory function like bluebird’s promisifyAll
.
We can apply the same approach here, and I made a crude attempt at doing just that. I must add that, whilst I consider myself a competent Node.js/Javascript programmer, I’m sure there’s a better way to implement this factory function.
My factory function will only work with promisified objects (told you it’s crude..), and replaces their xxxAsync
functions with a wrapper that takes in one more argument of the shape:
{"isEnabled": true,"probability": 0.5,"minDelay": 100,"maxDelay": 3000}
Again, it’s clumsy, but we can take the DocumentClient
from the AWS SDK, promisify it with bluebird, then wrap the promisified object with our own wrapper factory. Then, we can call its async functions with an optional argument to control the latency injection behaviour.
You can see this in action in the handler function for public-api-b
.
For some reason, the wrapped function is not able to record subsegments in X-Ray. I suspect it’s some nuance about Javascript or the X-Ray SDK that I do not fully understand.
Nonetheless, judging from the logs, I can confirm that the wrapped function does indeed inject latency to the getAsync
call to DynamoDB.
If you know of a way to improve the factory function, or to get the X-Ray tracing work with the wrapped function, please let me know in the comments.
The apiHandler
factory function I created in the correlation ID post is a good place to apply common implementation patterns that we want from our API functions, including:
// this is how you use the apiHandler factory function to create a// handler function for API Gateway event sourcemodule.exports.handler = apiHandler(co.wrap(function* (event, context) {... // do bunch of stuff
// instead of invoking the callback directly, you return the
// response you want to send, and the wrapped handler function
// would handle the serialization and invoking callback for you
// also, it takes care of other things for you, like logging
// the event source, and logging unhandled exceptions, etc.
return { message : "everything is awesome" };}));
In this case, it’s also a good place for us to inject latency to the API function.
However, to do that, we need to access the configuration for the function. Time to lift the responsibility for fetching configurations into the apiHandler
factory then!
The full apiHandler
factory function can be found here, below is a simplified version that illustrates the point.
Now, we can write our API function like the following.
Now that the apiHandler
has access to the config for the function, it can access the chaosConfig
object too.
Let’s extend the definition for the chaosConfig
object to add a functionLatencyInjectionConfig
property.
"chaosConfig": {"functionLatencyInjectionConfig": {"isEnabled": true,"probability": 0.5,"minDelay": 100,"maxDelay": 5000},"httpClientLatencyInjectionConfig": {"isEnabled": true,"probability": 0.5,"minDelay": 100,"maxDelay": 5000}}
With this additional configuration, we can modify the apiHandler
factory function to use it to inject latency to a function’s invocation much like what we did in the HTTP client.
Just like that, we can now inject latency to function invocations via configuration. This will work for any API function that is created using the apiHandler
factory.
With this change and both kinds of latency injections enabled, I can observe all the expected scenarios through X-Ray:
I can get further confirmation of the expected behaviour through logs, and the metadata recorded in the X-Ray traces.
From the logs, you can see how much latency was injected.
The amount of injected latency was also recorded as metadata in the X-Ray trace as well.
In this post we discussed:
The approach we have discussed here is driven by configuration, and the configuration is refreshed every 3 mins by default.
We can go much further with this.
The configurations can be more fine grained, and allow you to control latency injection to specific resources.
For example, instead of a blanket httpClientLatencyInjectionConfig
for all HTTP requests (including those requests to AWS services), the configuration can be specific to an API, or a DynamoDB table.
The configurations can be changed by an automated process to:
Again, we can look to Netflix for inspiration for such an automated platform.
Usually, you would want to enable one latency injection in a bounded context at a time. This helps contain the blast radius of unintended damages, and make sure your experiments are actually controlled. Also, when latency is injected at several places, it is harder to understand the causality we observe as there are multiple variables to consider.
Unless, of course, you’re validating against specific hypothesis such as:
The system can tolerate outage to both the primary store (DynamoDB) as well as the backup store (S3) for user preferences, and would return a hardcoded default value in that case.
Another good thing to do, is to inform the caller of the fact that latency has been added to the invocation by design.
This might take the form of a HTTP header in the response to tell the caller how much latency was injected in total. If you’re using an automated process to generate these experiments, then you should also include the id/tag/name for the specific instance of the experiment as HTTP header as well.
As I mentioned in the previous post, you need to apply common sense when deciding when and where you apply chaos engineering practices.
Don’t attempt an exercises that you know is beyond your abilities.
Before you even consider applying latency injection to your APIs in production, you need to think about how you can deal with these latency spikes given the inherent constraints of API Gateway and Lambda.
Unfortunately, we have run out of time to talk about this in this post, but come back in 2 weeks and we will talk about several strategies you can employ in part 3.
The code for the demo in this post is available on github here. Feel free to play around with it and let me know if you have any suggestions for improvement!
Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
You can contact me via Email, Twitter and LinkedIn.
Check out my new course, Complete Guide to AWS Step Functions.
In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.
Get your copy here.
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
You can also get 40% off the face price with the code ytcui.
Get your copy here.