The Road So Far part 1 : overview part 2 : testing and continuous delivery strategies part 3 : ops part 4 : building a scalable push notifications system part 5 : building a better recommendation system Having spoken to quite a few people about using AWS in production, testing and CI/CD are always high up the list of questions, so I’d like to use this post to discuss the approaches that we took at Yubl. Lambda Please keep in mind that this is a recollection of what we did, and why we chose to do things that way. I have heard others advocate very different approaches, and I’m sure they too have their reasons and their approaches no doubt work well for them. I hope to give you as much context (or, the “why”) as I can so you can judge whether or not our approach would likely work for you, and feel free to ask questions in the comments section. Testing In , and talked about the 3 levels of testing [Chapter 1]: Growing Object-Oriented Software, Guided by Tests Nat Pryce Steve Freeman — does the whole system work? Acceptance — does our code work against code we can’t change? Integration — do our objects do the right thing, are they easy to work with? Unit As you move up the level (acceptance -> unit) the speed of the feedback loop becomes faster, but you also have less confidence that your system will work correctly when deployed. Favour Acceptance and Integration Tests With the FAAS paradigm, there are more than ever (AWS even describes Lambda as the ) so the . Also, as the are easily accessible as service, it also makes these tests far easier to orchestrate and write than before. “code we can’t change” “glue for your cloud infrastructure” value of integration and acceptance tests are also higher than ever “code we can’t change” The functions we tend to write were fairly simple and didn’t have complicated logic (most of the time), but there were a lot of them, and they were loosely connected through messaging systems ( , , etc.) and APIs. The ROI for acceptance and integration tests are therefore far greater than unit tests. Kinesis SNS It’s for these reason that we decided (early on in ) to focus our efforts on writing acceptance and integration tests, and only write unit tests where the internal workings of a function is sufficiently complex. our journey Lambda No Mocks In , and also talked about why you shouldn’t mock types that you can’t change [Chapter 8], because… Growing Object-Oriented Software, Guided by Tests Nat Pryce Steve Freeman …We find that tests that mock external libraries often need to be to get the code into the right state for the functionality we need to exercise. complex The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the in both code and test… extra complexity …The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do… Even if we get it right once, we have to make sure that the tests when we upgrade the libraries… remain valid I believe the same principles apply here, and that you . shouldn’t mock services that you can’t change Integration Tests A function is ultimately a piece of code that AWS invokes on your behalf when some input event occurs. To test that it integrates correctly with downstream systems you can invoke the function from your chosen test framework (we used ). Lambda Mocha Since the purpose is to test the integration points, so it’s important to configure the function to use the same downstream systems as the real, deployed code. If your function needs to read from/write to a table then your integration test should be using the real table as opposed to something like . DynamoDB dynamodb-local It does mean that your tests can leave artefacts in your integration environment and can cause problems when running multiple tests in parallel (eg. the artefacts from one test affect results of other tests). Which is why, as a rule-of-thumb, I advocate: avoid hard-coded IDs, they often cause unintentional coupling between tests always clean up artefacts at the end of each test The same applies to acceptance tests. Acceptance Tests (this picture is slightly misleading in that the Mocha tests are not invoking the Lambda function programmatically, but rather invoking it indirectly via whatever input event the Lambda function is configured with — API Gateway, SNS, Kinesis, etc. More on this later.) …Wherever possible, an should exercise the system without directly calling its internal code. acceptance test end-to-end An end-to-end test interacts with the system : through its interface… only from the outside …We prefer to have the end-to-end tests exercise both the system and the … process by which it’s built and deployed This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime… – Growing Object-Oriented Software, Guided by Tests [Chapter 1] Once the integration tests complete successfully, we have good confidence that our code will work correctly when it’s deployed. The code is deployed, and the acceptance tests are run against the deployed system . end-to-end Take our for instance, one of the acceptance criteria is . Search API “when a new user joins, he should be searchable by first name/last name/username” The acceptance test first sets up the test condition — — by interacting with the system from the outside and calling the legacy API like the client app would. From here, a event will be fired into ; a function would process the event and add a new document in the index in ; the test would validate that the user is searchable via the . a new user joins new-user-joined Kinesis Lambda User CloudSearch Search API Avoid Brittle Tests Because a new user is added to asynchronously via a background process, it introduces eventual consistency to the system. This is a common challenge when you decouple features through events/messages. When testing these eventually consistent systems, you should avoid waiting fixed time periods (see protip 5 below) as it makes your tests brittle. CloudSearch In the test case, this means you shouldn’t write tests that: “new user joins” create new user wait 3 seconds validate user is searchable and instead, write something along the lines of: create new user validate user is searchable with retries 2.1 if expectation fails, then wait X seconds before retrying 2.2 repeat 2.3 allow Y retries before failing the test case Sharing test cases for Integration and Acceptance Testing We also found that, most of the time the only difference between our integration and acceptance tests is how our function code is invoked. Instead of duplicating a lot of code and effort, we used a simple technique to allow us to share the test cases. Suppose you have a test case such as the one below. The interesting bit is on line 22: let res = yield when.we_invoke_get_all_keys(region); In the module, the function will either when we_invoke_get_all_keys invoke the function code directly with a stubbed object, or context perform a HTTP GET request against the deployed API depending on the value of , which is an environment variable that is passed into the test via (see below) or the script we use for deployment (more on this shortly). process.env.TEST_MODE package.json bash Continuous Integration + Continuous Delivery Whilst we had around 170 functions running production, many of them work together to provide different features to the app. Our approach was to group these functions such that: Lambda functions that form the endpoints of an API are grouped in a project background processing functions for a feature are grouped in a project each project has its own repo functions in a project are tested and deployed together The rationale for this grouping strategy is to: achieve high cohesion for related functions improve code sharing where it makes sense (endpoints of an API are likely to share some logic since they operate within the same domain) Although functions are grouped into projects, they can still be deployed individually. We chose to deploy them as a unit because: it’s simple, and all related functions (in a project) have the same version no. it’s difficult to detect if a change to shared code will impact which functions deployment is fast, it makes little difference speed-wise whether we’re deploy one function or five functions For example, in the app, you have a feed of posts from people you follow (similar to your Twitter timeline). Yubl To implement this feature there was an API (with multiple endpoints) as well as a bunch of background processing functions (connected to streams and topics). Kinesis SNS The API has two endpoints, but they also share a common custom function, which is included as part of this project (and deployed together with the and functions). auth get get-yubl The background processing (initially only but later expanded to include as well, though the repo wasn’t renamed) functions have many shared code, such as the module you see below, as well as a number of modules in the folder. Kinesis SNS distribute lib All of these functions are deployed together as a unit. Deployment Automation We used the framework to do all of our deployments, and it took care of packaging, uploading and versioning our functions and APIs. It’s super useful and took care of most of the problem for us, but we still needed a thin layer around it to allow AWS profile to be passed in and to include testing as part of the deployment process. Serverless Lambda We could have scripted these steps on the CI server, but I have been burnt a few times by magic scripts that only exist on the CI server (and not in source control). To that end, every project has a simple script (like the one below) which gives you a common vocabulary to: build.sh run unit/integration/acceptance tests deploy your code Our Jenkins build configs do very little and just invoke this script with different params. Continuous Delivery To this day I’m still confused by Continuous “Delivery” vs Continuous “Deployment”. There seems to be , but is the one that I have heard the most often: several interpretations this Regardless of which definition is correct, what was most important to us was . the ability to deploy our changes to production quickly and frequently Whilst there were no technical reasons why we couldn’t deploy to production automatically, we didn’t do that because: it gives QA team opportunity to do thorough tests using actual client apps it gives the management team a sense of control over what is being released and when (I’m not saying if this is a good or bad thing, but merely what we wanted) In our setup, there were two AWS accounts: production , which has 4 environments — dev, test, staging, demo non-prod ( for development, for QA team, is a production-like, and for private beta builds for investors, etc.) dev test staging demo In most cases, when a change is pushed to , all the functions in that project are automatically tested, deployed and promoted all the way through to the environment. The deployment to is a manual process that can happen at our convenience and we generally avoid deploying to production on Friday afternoon (for obvious reasons ;-)). Bitbucket Lambda staging production Conclusions The approaches we have talked about worked pretty well for our team, but it was not without drawbacks. In terms of development flow, the focus on integration and acceptance tests meant slower feedback loops and the tests take longer to execute. Also, because we don’t mock downstream services it means we couldn’t run tests without internet connection, which is an occasional annoyance when you want to work during commute. These were explicit tradeoffs we made, and I stand by them even now and AFAIK everyone in the team feels the same way. In terms of deployment, I really missed the ability to do canary releases. Although this is offset by the fact that our user base was still relatively small and the speed with which one can deploy and rollback changes with functions was sufficient to limit the impact of a bad change. Lambda Whilst and doesn’t support canary releases out-of-the-box it is possible to do a DIY solution for APIs using weighted routing in . Essentially you’ll have: AWS Lambda API Gateway Route53 a canary stage for and associated function API Gateway Lambda deploy production builds to the canary stage first use weighted routing in Route53 to direct X% traffic to the canary stage monitor metrics, and when you’re happy with the canary build, promote it to production Again, this would only work for APIs and not for background processing ( , , , etc.). SNS Kinesis S3 So that’s it folks, hope you’ve enjoyed this post, see you in part 3! Ciao! Links Growing Object-Oriented Software, Guided by Tests [slides] AWS Lambda from the Trenches [InfoQ] Complexity is outside the Code Like what you’re reading but want more help? I’m happy to offer my services as an and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools. independent consultant I’m based in and currently the only UK-based . I have nearly of with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve . London, UK AWS Serverless Hero 10 years experience here I can also run an to help you get with your serverless architecture. You can find out more about the two-day workshop , which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices. in-house workshops production-ready here If you prefer to study at your own pace, then you can also find all the same content of the workshop as a I have produced for Manning. We will cover topics including: video course authentication authorization with API Gateway Cognito & & testing running functions locally & CI/CD log aggregation monitoring best practices distributed tracing with X-Ray tracking correlation IDs performance cost optimization & error handling config management canary deployment VPC security leading practices for Lambda, Kinesis, and API Gateway You can also get the face price with the code . Hurry though, this discount is only available while we’re in Manning’s Early Access Program (MEAP). 40% off ytcui