All the way back in 2011, Simon Wardley had identified Chaos Engines as a practice that will be employed by the next generation of tech companies, along with continuous deployment, being data-driven, and organised around small, autonomous teams (think microservices & inverse-conway’s law). This is the first of a multipart series that explores ideas on how we could apply the principles of chaos engineering to serverless architectures built around Lambda functions. part 1: how can we apply principles of chaos engineering to Lambda? part 2: applying latency injection for APIs part 3: dealing with latency spikes and timeouts (coming 21/11/2017) part 4: applying fault injection for Lambda functions (coming ?) There’s no question about it, Netflix has popularised the of chaos engineering. By open sourcing some of their tools — notably the — they have also helped others build confidence in their system’s capability to withstand turbulent conditions in production. principles Simian Army There seems to be a renewed interest in chaos engineering recently. As Russ Miles noted in a recent , perhaps many companies have finally come to understand that chaos engineering is not about “hurting production”, but to build better understanding of, and confidence in a system’s resilience through experiments. post controlled This trend has been helped by the valuable (and freely available) information that Netflix has published, such as the e-book, and . Chaos Engineering principlesofchaos.org Tools such as by (the folks behind the load test tool) look to replicate Netflix’s , but execute from inside a Lambda function instead of an EC2 instance — hence bringing you the cost saving and convenience Lambda offers. chaos-lambda Shoreditch Ops Artillery Chaos Monkey I want to ask a different question however: how can one apply the principles of chaos engineering and some of the current practices to a serverless architecture comprised of Lambda functions? When your system runs on EC2 instances, then naturally, you build resilience by designing for the most likely failure mode — server crashes (due to both hardware and software issues). Hence, a controlled experiment to validate the resilience of your system would artificially recreate the failure scenario by terminating EC2 instances, and then AZs, and then entire regions. AWS Lambda, however, is a higher-level abstraction and has different failure modes to its EC2 counterparts. Hypothesis that focus on no longer apply as the platform handles these failure modes for you out of the box. “what if we lose these EC2 instances” We need to ask different questions in order to understand the weaknesses within our serverless architectures. More inherent chaos, not less “ . Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. , take advantage of increasing flexibility and velocity, and ” We need to identify weaknesses before they manifest in system-wide, aberrant behaviors We need a way to manage the chaos inherent in these systems have confidence in our production deployments despite the complexity that they represent. — Principles of Chaos Engineering Having , I have some understanding of the dangers awaiting you in this new world. built and operated a non-trivial serverless architecture If anything, there are a lot more inherent chaos and complexity in these systems built around Lambda functions. modularity (unit of deployment) shifts from “services” to “functions”, and there are a lot more of them it’s harder to harden around the boundaries, because you need to harden around each function as opposed to a service which encapsulates a set of related functionalities there are more intermediary services (eg. Kinesis, SNS, API Gateway just to name a few), each with their own failure modes there are more configurations overall (timeout, IAM permissions, etc.), and therefore more opportunities for misconfiguration Also, since we have traded off more control of our infrastructure* it means we now face more unknown failure modes** and often there’s little we can do when an outage does occur***. * For better scalability, availability, cost efficiency and more convenience, which I for one, think it’s a fair trade in most cases . ** Everything the platform does for you — scheduling containers, scaling, polling Kinesis, retry failed invocations, etc. — have their own failure modes. These are often not obvious to us since they’re implementation details that are typically undocumented and are prone to change without notice. *** For example, if an outage happens and prevents Lambda functions from processing Kinesis events, then we have no meaningful alternative than to wait for AWS to fix the problem. Since the current position on the shards is abstracted away and unavailable to us, we can’t even replace the Lambda functions with KCL processors that run on EC2. Applying chaos to AWS Lambda A good exercise regime would continuously push you to your limits but never actually put you over the limit and cause injury. If there’s an exercise that is clearly beyond your current abilities then surely you would not attempt it as the only possible outcome is getting yourself hurt! . In order to “know” what the experiments tell us about the resilience of our system we also need to decide what metrics to monitor — ideally using client-side metrics, since the most important metric is the quality of service our users experience. The same common sense should be applied when designing controlled experiments for your serverless architecture There are plenty of failure modes that we know about and can design for, and we can run simple experiments to validate our design. For example, since a serverless architecture is (almost always) also a microservice architecture, many of its inherent failure modes still apply: improperly tuned timeouts, especially for intermediate services, which can cause services at the edge to also timeout Intermediate services should have more strict timeout settings compared to services at the edge. missing error handling, which allows exceptions from downstream services to escape missing fallbacks for when a downstream service is unavailable or experiences an outage Over the next couple of posts, we will explore how we can apply the practices of latency and fault injection to Lambda functions in order to simulate these failure modes and validate our design. Further readings: ChAP: Chaos Automation Platform Deploying the Netflix API The Netflix Simian Army Testing in production: Yes you can (and should) Design for latency issues 1000 actors, one chaos monkey… and everything’s OK [Ebook] Chaos Engineering [Book] Release it! [Book] Drift into Failure principlesofchaos.org Hi, my name is . I’m an and the author of . I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless. Yan Cui AWS Serverless Hero Production-Ready Serverless You can contact me via , and . Email Twitter LinkedIn Check out my new course, . Complete Guide to AWS Step Functions In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices. Get your copy . here Come learn about operational for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more. BEST PRACTICES You can also get off the face price with the code . 40% ytcui Get your copy . here

Amazon

Netflix

Twitter

Velocity

Mind the 75GB limit on AWS Lambda deployment packages

Capture and forward correlation IDs through different Lambda event sources

Ask me for help about serverless

Read My Stories

Too Long; Didn't Read

How can we apply the principles of chaos engineering to AWS Lambda?

How can we apply the principles of chaos engineering to AWS Lambda?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

6 Tips To Scale an AppSync Project To 200+ Resolvers That Will Blow Your Mind

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

6 Tips To Scale an AppSync Project To 200+ Resolvers That Will Blow Your Mind

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps