We can apply latency injection to APIs created with API Gateway and AWS Lambda. Our approach should allow us to configure when to add arbitrary delay (and how much) to our API endpoints to ensure inter-service communications are tuned with proper timeout values. This is part 2 of a multipart series that explores ideas on how we could apply the principles of chaos engineering to serverless architectures built around Lambda functions. part 1: how can we apply principles of chaos engineering to Lambda? part 2: applying latency injection for APIs part 3: dealing with latency spikes and timeouts (coming?) part 4: applying fault injection for Lambda functions (coming ?) The most common issue I have encountered in production are latency/performance related. They can be symptomatic of a host of underlying causes ranging from AWS network issues (which can also manifest itself in latency/error-rate spikes in any of the AWS services), overloaded servers, to simple GC pauses. — as much as you can improve the performance of your application, things will go wrong, eventually, and often they’re out of your control. Latency spikes are inevitable So you must design for them, and degrade the quality of your application to minimize the impact on your users. gracefully In the case of API Gateway and Lambda, there are additional considerations: for integration points, so even if your Lambda function can run for up to 5 mins, API Gateway will timeout way before that API Gateway has a hard limit of 29s timeout You need to consider the effect of , which is . A cold start in an intermediate service can easily cause the outer service to timeout. cold starts heavily influenced by both language runtime and memory allocation Where to inject latency Suppose our client application communicates directly with 2 public facing APIs, whom in turn depends on an internal API. In this setup, I can think of 3 places where we can inject latency and each would validate a different hypothesis. Inject latency at HTTP clients The first, and easiest place to inject latency is in the HTTP client library we use to communicate with the internal API. This will test that our function . has appropriate timeout on this HTTP communication and can degrade gracefully when this request time out Furthermore, this practice , such as DynamoDB. We will discuss we can inject latency to these 3rd party libraries later in the post. should also be applied to other 3rd party services we depend on how This is a reasonably safe place to inject latency as the blast radius is limited to this function. immediate However, you can (and arguably, should) consider applying this type of latency injection to intermediate services as well. Doing so does carry as it has a in failures cases — ie. if the function under test does not degrade gracefully then it can cause unintended problems to outer services. In this case, the blast radius for these failure cases is the same as if you’re injecting latency to the intermediate functions directly. extra risk broader blast radius Inject latency to intermediate functions You can also inject latency directly to the functions themselves (we’ll look at later on). This has the same effect as injecting latency to the HTTP client to each of its dependents, except it’ll affect all its dependents at once. how This might seem risky (it can be), but is an effective way to validate that every service that depends on this API endpoint is , and handling timeouts gracefully. expecting It makes most sense when applied to intermediate APIs that are part of a bounded context (or, a microservice), maintained by the same team of developers. That way, you avoid unleashing chaos upon unsuspecting developers who might not be ready to deal with the chaos. That said, I think there is a good counter-argument for doing just that. We often fall into the pitfall of using the performance characteristics of dev environments as predictor for the production environment. Whilst we seldom experience load-related latency problems in the dev environments — because we don’t have enough load in those environments to begin with — production is quite another story. Which means, we’re not to think about these failure modes during development. programmed So, a good way to and to programme them to expect timeouts is to expose them to these failure modes regularly in the dev environments, by injecting latency to our internal APIs in those environments. hack the brains of your fellow developers In fact, if we make our dev environments exhibit the most hostile and turbulent conditions that our systems , then we know for sure that any system that makes its way to production are . should be expected to handle ready to face what awaits it in the wild Inject latency to public-facing functions So far, we have focused on validating the handling of latency spikes and timeouts in our APIs. . The same validation is needed for our client application We can apply all the same arguments mentioned above here. By injecting latency to our public-facing API functions (in both production as well as dev environments), we can: validate the client application handles latency spikes and timeouts gracefully, and offers the best UX as possible in these situations train our client developers to expect latency spikes and timeouts When I was working on a MMORPG at Gamesys years ago, we uncovered a host of frailties in the game when we to our APIs. The game would crash during startup if any of the first handful of requests fails. In some cases, if the response time was longer than a few seconds then the game would also get into a weird state because of race conditions. injected latency spikes and faults Turns out I was setting my colleagues up for failure in production because the dev environment was so forgiving and gave them a false sense of comfort. With that, let’s talk about we can apply the practice of latency injection. how But wait, can’t you inject latency in the client-side HTTP clients too? Absolutely! And you should! However, for the purpose of this post we are going to look at how and where we can inject latency to our Lambda functions only, hence why I have willfully ignored this part of the equation. How to inject latency There are 2 aspects to actually injecting latencies: adding delays to operations configuring how often and how much delay to add If you read my previous posts on and , then you have already seen the basic building blocks we need to do both. capturing and forwarding correlation IDs managing configurations with SSM Parameter Store How to inject latency to HTTP client Since you are unlikely to write a HTTP client from scratch, so I consider the problem for injecting latency to HTTP client and 3rd party clients (such as the AWS SDK) to be one and the same. A couple of solutions jump to mind: in static languages, you can consider using a static weaver such as AspectJ or PostSharp, this is the approach I took previously in static languages, you can consider using dynamic proxies, which many IoC frameworks offer (another form of AOP) you can , either manually or with a factory function ( ’s promisifyAll function is a good example) create a wrapper for the client bluebirdjs Since I’m going to use Node.js as example, I’m going to focus on wrappers. For the HTTP client, given the relatively small number of methods you will need, it’s feasible to craft the wrapper by hand, especially if you have a particular API design in mind. Using the HTTP client I created for the as base, I modified it to accept a configuration object to control the latency injection behaviour. correlation ID post { : , : , : , : } "isEnabled" true "probability" 0.5 "minDelay" 100 "maxDelay" 5000 You can find this modified HTTP client , below is a simplified version of this client (which uses superagent under the hood). here ; co = ( ); = ( ); injectLatency = co.wrap( { (config.isEnabled === && .random() < config.probability) { delayRange = config.maxDelay - config.minDelay; delay = .floor(config.minDelay + .random() * delayRange); .log( ); .delay(delay); } }); ... Req = { request = getRequest(options); fullResponse = options.resolveWithFullResponse === ; latencyInjectionConfig = options.latencyInjectionConfig; injectLatency(latencyInjectionConfig); { resp = exec(request); fullResponse ? resp : resp.body; } (e) { (e.response && e.response.error) { e.response.error; } e; } }; .exports = Req; 'use strict' const require 'co' const Promise require 'bluebird' let * ( ) function config if true Math let let Math Math console `injecting [ ms] latency to HTTP request...` ${delay} yield Promise // other helper functions, like getRequest, etc. // options: { // uri : string // method : GET (default) | POST | PUT | HEAD // headers : object // qs : object // body : object, // resolveWithFullResponse : bool (default to false) // latencyInjectionConfig : { isEnabled: bool, probability: Double, minDelay: Double, maxDelay: Double } // } let * ( ) function options let let true let yield // <- this is the import bit try let yield return catch if throw throw module To configure the function and the latency injection behaviour, we can use the configClient I first created in the . SSM Parameter Store post First, let’s create the configs in the SSM Parameter Store. The configs contains the URL for the internal API, as well as a . For now, we just have a property, which is used to control the HTTP client’s latency injection behaviour. chaosConfigobject httpClientLatencyInjectionConfig { : , : { : { : , : , : , : } } } "internalApi" "https://xx.amazonaws.com/dev/internal" "chaosConfig" "httpClientLatencyInjectionConfig" "isEnabled" true "probability" 0.5 "minDelay" 100 "maxDelay" 5000 Using the aforementioned , we can fetch the JSON config from SSM Parameter Store at runtime. configClient configKey = ; configObj = configClient.loadConfigs([ configKey ]); config = .parse( configObj[ ]); internalApiUrl = config.internalApi; chaosConfig = config.chaosConfig || {}; injectionConfig = chaosConfig.httpClientLatencyInjectionConfig; reply = http({ : , : internalApiUrl, : injectionConfig }); const "public-api-a.config" const let JSON yield "public-api-a.config" let let let let yield method 'GET' uri latencyInjectionConfig The above configuration gives us a 50% chance of injecting a latency between 100ms and 5sec when we make the HTTP request to . internal-api This is reflected in the following X-Ray traces. How to inject latency to AWSSDK With the AWS SDK, it’s not feasible to craft the wrapper by hand. Instead, we could do with a factory function like ’s . bluebird promisifyAll We can apply the same approach here, and I made a at doing just that. crude attempt I must add that, whilst I consider myself a competent Node.js/Javascript programmer, I’m sure there’s a better way to implement this factory function. My factory function will only work with promisified objects (told you it’s crude..), and replaces their functions with a wrapper that takes in one more argument of the shape: xxxAsync { : , : , : , : } "isEnabled" true "probability" 0.5 "minDelay" 100 "maxDelay" 3000 Again, it’s clumsy, but we can take the from the AWS SDK, promisify it with bluebird, then wrap the promisified object with our own wrapper factory. Then, we can call its async functions with an optional argument to control the latency injection behaviour. DocumentClient You can see this in action in the handler function for . public-api-b ; co = ( ); apiHandler = ( ); injectable = ( ); = ( ); AWS = ( ); asyncDDB = .promisifyAll( AWS.DynamoDB.DocumentClient()); dynamodb = injectable.injectableAll(asyncDDB); .exports.handler = apiHandler( , co.wrap( { chaosConfig = config.chaosConfig || {}; latencyInjectionConfig = chaosConfig.latencyInjectionConfig; req = { : , : { : } }; item = dynamodb.getAsync(req, latencyInjectionConfig); { : , : item }; }) ); 'use strict' const require 'co' const require '../lib/apiHandler' const require '../lib/injectable' const Promise require 'bluebird' const require 'aws-sdk' const Promise new const module "public-api-b.config" * ( ) function event, context, config let let let TableName 'latency-injection-demo-dev' Key id 'foo' // the wrapped (and promisified) DocumentClient's async functions can // take an optional argument (the last argument) which is expect to be // the format of the latencyInjectionConfig, ie. of the shape // { isEnabled, probability, minDelay, maxDelay } let yield return message "everything is still awesome" reply For some reason, the wrapped function is not able to record subsegments in X-Ray. I suspect it’s some nuance about Javascript or the X-Ray SDK that I do not fully understand. Nonetheless, judging from the logs, I can confirm that the wrapped function does indeed inject latency to the call to DynamoDB. getAsync If you know of a way to improve the factory function, or to get the X-Ray tracing work with the wrapped function, please let me know in the comments. How to inject latency to function invocations The apiHandler factory function I created in the is a good place to apply that we want from our API functions, including: correlation ID post common implementation patterns log the event source as debug log the response and/or error from the invocation (which, surprisingly, Lambda doesn’t capture by default) initialize global context (eg. for tracking correlation IDs) handle serialization for the response object etc.. .exports.handler = apiHandler( co.wrap( { ... { : }; }) ); // this is how you use the apiHandler factory function to create a // handler function for API Gateway event source module * ( ) function event, context // do bunch of stuff // instead of invoking the callback directly, you return the // response you want to send, and the wrapped handler function // would handle the serialization and invoking callback for you // also, it takes care of other things for you, like logging // the event source, and logging unhandled exceptions, etc. return message "everything is awesome" In this case, it’s also a good place for us to inject latency to the API function. However, to do that, we need to access the configuration for the function. Time to lift the responsibility for fetching configurations into the apiHandler factory then! The full apiHandler factory function can be found , below is a simplified version that illustrates the point. here ; co = ( ); configClient = ( ); ... { configObj = configKey ? configClient.loadConfigs([ configKey ]) : ; co.wrap( { .log( .stringify(event)); config = configObj ? .parse( configObj[configKey]) : ; { result = .resolve(f(event, context, config)); result = result || {}; .log( , .stringify(result)); cb( , OK(result)); } (err) { .error( , err); cb(err); } }); } .exports = createApiHandler; 'use strict' const require 'co' const require './configClient' // helper functions // configKey : eg. public-api-a.config // f : handler function with signature (event, context, configValue) => Promise[T] ( ) function createApiHandler configKey, f let undefined // return a wrapped handler function which takes care of common plumbing return * ( ) function event, context, cb console JSON let JSON yield undefined try let yield Promise console 'SUCCESS' JSON null catch console "Failed to process request" module Now, we can write our API function like the following. ; co = ( ); http = ( ); apiHandler = ( ); .exports.handler = apiHandler( , co.wrap( { uri = config.internalApi; chaosConfig = config.chaosConfig || {}; latencyInjectionConfig = chaosConfig.httpClientLatencyInjectionConfig; reply = http({ : , uri, latencyInjectionConfig }); { : , : reply }; }) ); 'use strict' const require 'co' const require '../lib/http' const require '../lib/apiHandler' module "public-api-a.config" * ( ) function event, context, config let let let let yield method 'GET' return message "everything is awesome" reply Now that the has access to the config for the function, it can access the object too. apiHandler chaosConfig Let’s extend the definition for the object to add a property. chaosConfig functionLatencyInjectionConfig : { : { : , : , : , : }, : { : , : , : , : } } "chaosConfig" "functionLatencyInjectionConfig" "isEnabled" true "probability" 0.5 "minDelay" 100 "maxDelay" 5000 "httpClientLatencyInjectionConfig" "isEnabled" true "probability" 0.5 "minDelay" 100 "maxDelay" 5000 With this additional configuration, we can modify the apiHandler factory function to use it to inject latency to a function’s invocation much like what we did in the HTTP client. exec = co.wrap( { { result = .resolve(f(event, context, config)); result || {}; } { latencyInjectionConfig = _.get(config, , {}); injectLatency(latencyInjectionConfig); } }); { configObj = configKey ? configClient.loadConfigs([ configKey ]) : ; co.wrap( { .log( .stringify(event)); config = configObj ? .parse( configObj[configKey]) : ; { result = exec(f, event, context, config); .log( , .stringify(result)); cb( , OK(result)); } (err) { .error( , err); cb(err); } }); } let * ( ) function f, event, context, config try let yield Promise return finally let 'chaosConfig.functionLatencyInjectionConfig' yield ( ) function createApiHandler configKey, f let undefined return * ( ) function event, context, cb console JSON let JSON yield undefined try let yield console 'SUCCESS' JSON null catch console "Failed to process request" Just like that, we can now inject latency to function invocations via configuration. This will work for any API function that is created using the apiHandler factory. With this change and both kinds of latency injections enabled, I can observe all the expected scenarios through X-Ray: no latency was injected latency was injected to the function invocation only latency was injected to the HTTP client only latency was injected to both HTTP client and the function invocation, but the invocation did not timeout as a result latency was injected to both HTTP client and the function invocation, and the invocation times out as a result I can get further confirmation of the expected behaviour through logs, and the metadata recorded in the X-Ray traces. Recap, and future works In this post we discussed: why you should consider applying the practice of latency injection to APIs created with API Gateway and Lambda additional considerations specific to API Gateway and Lambda where you can inject latencies, and why you should consider injecting latency at each of these places how you can inject latency in HTTP clients, AWS SDK, as well as the function invocation The approach we have discussed here is driven by configuration, and the configuration is . refreshed every 3 mins by default We can go much further with this. Fine grained configuration The configurations can be more fine grained, and allow you to control latency injection to specific resources. For example, instead of a blanket httpClientLatencyInjectionConfig for all HTTP requests (including those requests to AWS services), the configuration can be specific to an API, or a DynamoDB table. Automation The configurations can be changed by an automated process to: run routine validations daily stop all latency injections during off hours, and holidays forcefully stop all latency injections, eg. during an actual outage orchestrate complex scenarios that are difficult to manage by hand, eg. enable latency injection at several places at once Again, we can look to Netflix for for such an automated platform. inspiration Usually, you would want to enable one latency injection in a bounded context at a time. This helps contain the blast radius of unintended damages, and make sure your experiments are actually . Also, when latency is injected at several places, it is harder to understand the causality we observe as there are multiple variables to consider. controlled Unless, of course, you’re validating against specific hypothesis such as: The system can tolerate outage to both the primary store (DynamoDB) as well as the backup store (S3) for user preferences, and would return a hardcoded default value in that case. Better communication Another good thing to do, is to inform the caller of the fact that latency has been added to the invocation by design. This might take the form of a HTTP header in the response to tell the caller how much latency was injected in total. If you’re using an automated process to generate these experiments, then you should also include the id/tag/name for the specific instance of the experiment as HTTP header as well. What's next? As I mentioned in the previous post, you need to apply common sense when deciding when and where you apply chaos engineering practices. Don’t attempt an exercises that you know is beyond your abilities. Before you even consider applying latency injection to your APIs in production, you need to think about how you can deal with these latency spikes given the inherent constraints of API Gateway and Lambda. The code for the demo in this post is available on github . Feel free to play around with it and let me know if you have any suggestions for improvement! here References Hack the brains of other people with API design Design for latency issues Capture and forward correlation IDs through different Lambda event sources You should use SSM Parameter Store over Lambda env variables ChAP: Chaos Automation Platform Hi, my name is . I’m an and the author of . I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless. Yan Cui AWS Serverless Hero Production-Ready Serverless You can contact me via , and . Email Twitter LinkedIn Check out my new course, .In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices. Complete Guide to AWS Step Functions Get your copy . here Come learn about operational for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more. BEST PRACTICES You can also get off the face price with the code . 40% ytcui Get your copy . here Originally published at https://theburningmonk.com on November 13, 2017.