!(https://hackernoon.com/hn-images/1*56lZO-ri61SY-ndXv5PePA.png)\n\nAll the way back in 2011, Simon Wardley had identified Chaos Engines as a practice that will be employed by the next generation of tech companies, along with continuous deployment, being data-driven, and organised around small, autonomous teams (think microservices & inverse-conway’s law).\n\nThis is the first of a multipart series that explores ideas on how we could apply the principles of chaos engineering to serverless architectures built around Lambda functions.\n\n* **part 1: how can we apply principles of chaos engineering to Lambda?**\n* part 2: [applying latency injection for APIs](https://hackernoon.com/chaos-engineering-and-aws-lambda-latency-injection-ddeb4ff8d983)\n* part 3: dealing with latency spikes and timeouts (coming 21/11/2017)\n* part 4: applying fault injection for Lambda functions (coming ?)\n\nThere’s no question about it, Netflix has popularised the _principles_ of chaos engineering. By open sourcing some of their tools — notably the [Simian Army](https://github.com/Netflix/SimianArmy) — they have also helped others build confidence in their system’s capability to withstand turbulent conditions in production.\n\nThere seems to be a renewed interest in chaos engineering recently. As Russ Miles noted in a recent [post](https://medium.com/russmiles/chaos-engineering-why-the-label-matters-35ddbb974fa5), perhaps many companies have finally come to understand that chaos engineering is not about “hurting production”, but to build better understanding of, and confidence in a system’s resilience through **_controlled_** experiments.\n\nThis trend has been helped by the valuable (and freely available) information that Netflix has published, such as the [Chaos Engineering](http://oreil.ly/2tZU1Sn) e-book, and [principlesofchaos.org](http://principlesofchaos.org/).\n\n!(https://hackernoon.com/hn-images/1*JlnapsrFEDH771EYRel0VA.png)\n\nTools such as [chaos-lambda](https://github.com/shoreditch-ops/chaos-lambda) by _Shoreditch Ops_ (the folks behind the [Artillery](https://artillery.io/) load test tool) look to replicate Netflix’s [Chaos Monkey](https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey), but execute from inside a Lambda function instead of an EC2 instance — hence bringing you the cost saving and convenience Lambda offers.\n\nI want to ask a different question however: **how can one apply the principles of chaos engineering and some of the current practices to a serverless architecture comprised of Lambda functions?**\n\nWhen your system runs on EC2 instances, then naturally, you build resilience by designing for the most likely failure mode — server crashes (due to both hardware and software issues). Hence, a controlled experiment to validate the resilience of your system would artificially recreate the failure scenario by terminating EC2 instances, and then AZs, and then entire regions.\n\nAWS Lambda, however, is a higher-level abstraction and has different failure modes to its EC2 counterparts. Hypothesis that focus on _“what if we lose these EC2 instances”_ no longer apply as the platform handles these failure modes for you out of the box.\n\nWe need to ask different questions in order to understand the weaknesses within our serverless architectures.\n\n### More inherent chaos, not less\n\n> “**We need to identify weaknesses before they manifest in system-wide, aberrant behaviors**. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. **We need a way to manage the chaos inherent in these systems**, take advantage of increasing flexibility and velocity, and **have confidence in our production deployments despite the complexity that they represent.**”\n\n> — [Principles of Chaos Engineering](http://principlesofchaos.org/)\n\nHaving [built and operated a non-trivial serverless architecture](https://hackernoon.com/yubls-road-to-serverless-part-1-overview-ca348370acde), I have some understanding of the dangers awaiting you in this new world.\n\nIf anything, there are a lot more inherent chaos and complexity in these systems built around Lambda functions.\n\n* modularity (unit of deployment) shifts from “services” to “functions”, and there are a lot more of them\n* it’s harder to harden around the boundaries, because you need to harden around each function as opposed to a service which encapsulates a set of related functionalities\n* there are more intermediary services (eg. Kinesis, SNS, API Gateway just to name a few), each with their own failure modes\n* there are more configurations overall (timeout, IAM permissions, etc.), and therefore more opportunities for misconfiguration\n\nAlso, since we have traded off more control of our infrastructure\\* it means we now face more unknown failure modes\\*\\* and often there’s little we can do when an outage does occur\\*\\*\\*.\n\n_\\* For better scalability, availability, cost efficiency and more convenience, which I for one, think it’s a fair trade in_ **_most cases_**_._\n\n_\\*\\* Everything the platform does for you — scheduling containers, scaling, polling Kinesis, retry failed invocations, etc. — have their own failure modes. These are often not obvious to us since they’re implementation details that are typically undocumented and are prone to change without notice._\n\n_\\*\\*\\* For example, if an outage happens and prevents Lambda functions from processing Kinesis events, then we have no meaningful alternative than to wait for AWS to fix the problem. Since the current position on the shards is abstracted away and unavailable to us, we can’t even replace the Lambda functions with KCL processors that run on EC2._\n\n### Applying chaos to AWS Lambda\n\nA good exercise regime would continuously push you to your limits but never actually put you over the limit and cause injury. If there’s an exercise that is clearly beyond your current abilities then surely you would not attempt it as the only possible outcome is getting yourself hurt!\n\n**The same common sense should be applied when designing controlled experiments for your serverless architecture**. In order to “know” what the experiments tell us about the resilience of our system we also need to decide what metrics to monitor — ideally using client-side metrics, since the most important metric is the quality of service our users experience.\n\nThere are plenty of failure modes that we know about and can design for, and we can run simple experiments to validate our design. For example, since a serverless architecture is (almost always) also a microservice architecture, many of its inherent failure modes still apply:\n\n* improperly tuned timeouts, especially for intermediate services, which can cause services at the edge to also timeout\n\n!(https://hackernoon.com/hn-images/1*OFKVj5FywcNs-CpsXFPi0w.png)\n\nIntermediate services should have more strict timeout settings compared to services at the edge.\n\n* missing error handling, which allows exceptions from downstream services to escape\n\n!(https://hackernoon.com/hn-images/1*DSKdK5TqOy8FZ4F9xUmK8g.png)\n\n* missing fallbacks for when a downstream service is unavailable or experiences an outage\n\n!(https://hackernoon.com/hn-images/1*1ZK4_SQEk3X1g4PowOOCIg.png)\n\nOver the next couple of posts, we will explore how we can apply the practices of latency and fault injection to Lambda functions in order to simulate these failure modes and validate our design.\n\n### Further readings:\n\n* [ChAP: Chaos Automation Platform](https://medium.com/netflix-techblog/chap-chaos-automation-platform-53e6d528371f)\n* [Deploying the Netflix API](https://medium.com/netflix-techblog/deploying-the-netflix-api-79b6176cc3f0)\n* [The Netflix Simian Army](https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116)\n* [Testing in production: Yes you can (and should)](https://opensource.com/article/17/8/testing-production)\n* [Design for latency issues](http://theburningmonk.com/2015/04/design-for-latency-issues/)\n* [1000 actors, one chaos monkey… and everything’s OK](https://erlangcentral.org/blog/presentations/1000-actors-one-chaos-monkey-and-everything-ok/)\n* \\[Ebook\\] [Chaos Engineering](http://www.oreilly.com/webops-perf/free/chaos-engineering.csp)\n* \\[Book\\] [Release it!](http://amzn.to/1pedVvt)\n* \\[Book\\] [Drift into Failure](http://amzn.to/1CB0I6D)\n* [principlesofchaos.org](http://principlesofchaos.org/)\n\n!(https://hackernoon.com/hn-images/0*b_1R345KzKSaI8sg.png)\n\nHi, my name is **Yan Cui**. I’m an [**AWS Serverless Hero**](https://aws.amazon.com/developer/community/heroes/yan-cui/) and the author of [**Production-Ready Serverless**](https://bit.ly/production-ready-serverless). I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.\n\nYou can contact me via [Email](mailto:theburningmonk.com), [Twitter](https://twitter.com/theburningmonk) and [LinkedIn](https://www.linkedin.com/in/theburningmonk/).\n\nCheck out my new course, [**Complete Guide to AWS Step Functions**](https://theburningmonk.thinkific.com/courses/complete-guide-to-aws-step-functions).\n\nIn this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.\n\nGet your copy [here](https://theburningmonk.thinkific.com/courses/complete-guide-to-aws-step-functions).\n\n!(https://hackernoon.com/hn-images/0*ZYcHhOOzUf5VB-Ri.png)\n\nCome learn about operational **BEST PRACTICES** for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.\n\nYou can also get **40%** off the face price with the code **ytcui**.\n\nGet your copy [here](https://bit.ly/production-ready-serverless).