All the way back in 2011, Simon Wardley had identified Chaos Engines as a practice that will be employed by the next generation of tech companies, along with continuous deployment, being data-driven, and organised around small, autonomous teams (think microservices & inverse-conway’s law).
This is the first of a multipart series that explores ideas on how we could apply the principles of chaos engineering to serverless architectures built around Lambda functions.
There’s no question about it, Netflix has popularised the principles of chaos engineering. By open sourcing some of their tools — notably the Simian Army — they have also helped others build confidence in their system’s capability to withstand turbulent conditions in production.
There seems to be a renewed interest in chaos engineering recently. As Russ Miles noted in a recent post, perhaps many companies have finally come to understand that chaos engineering is not about “hurting production”, but to build better understanding of, and confidence in a system’s resilience through controlled experiments.
Tools such as chaos-lambda by Shoreditch Ops (the folks behind the Artillery load test tool) look to replicate Netflix’s Chaos Monkey, but execute from inside a Lambda function instead of an EC2 instance — hence bringing you the cost saving and convenience Lambda offers.
I want to ask a different question however: how can one apply the principles of chaos engineering and some of the current practices to a serverless architecture comprised of Lambda functions?
When your system runs on EC2 instances, then naturally, you build resilience by designing for the most likely failure mode — server crashes (due to both hardware and software issues). Hence, a controlled experiment to validate the resilience of your system would artificially recreate the failure scenario by terminating EC2 instances, and then AZs, and then entire regions.
AWS Lambda, however, is a higher-level abstraction and has different failure modes to its EC2 counterparts. Hypothesis that focus on “what if we lose these EC2 instances” no longer apply as the platform handles these failure modes for you out of the box.
We need to ask different questions in order to understand the weaknesses within our serverless architectures.
“We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.”
Having built and operated a non-trivial serverless architecture, I have some understanding of the dangers awaiting you in this new world.
If anything, there are a lot more inherent chaos and complexity in these systems built around Lambda functions.
Also, since we have traded off more control of our infrastructure* it means we now face more unknown failure modes** and often there’s little we can do when an outage does occur***.
* For better scalability, availability, cost efficiency and more convenience, which I for one, think it’s a fair trade in most cases.
** Everything the platform does for you — scheduling containers, scaling, polling Kinesis, retry failed invocations, etc. — have their own failure modes. These are often not obvious to us since they’re implementation details that are typically undocumented and are prone to change without notice.
*** For example, if an outage happens and prevents Lambda functions from processing Kinesis events, then we have no meaningful alternative than to wait for AWS to fix the problem. Since the current position on the shards is abstracted away and unavailable to us, we can’t even replace the Lambda functions with KCL processors that run on EC2.
A good exercise regime would continuously push you to your limits but never actually put you over the limit and cause injury. If there’s an exercise that is clearly beyond your current abilities then surely you would not attempt it as the only possible outcome is getting yourself hurt!
The same common sense should be applied when designing controlled experiments for your serverless architecture. In order to “know” what the experiments tell us about the resilience of our system we also need to decide what metrics to monitor — ideally using client-side metrics, since the most important metric is the quality of service our users experience.
There are plenty of failure modes that we know about and can design for, and we can run simple experiments to validate our design. For example, since a serverless architecture is (almost always) also a microservice architecture, many of its inherent failure modes still apply:
Intermediate services should have more strict timeout settings compared to services at the edge.
Over the next couple of posts, we will explore how we can apply the practices of latency and fault injection to Lambda functions in order to simulate these failure modes and validate our design.
Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
Check out my new course, Complete Guide to AWS Step Functions.
In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.
Get your copy here.
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
You can also get 40% off the face price with the code ytcui.
Get your copy here.