The Road So Far part 1 : overview part 2 : testing and continuous delivery strategies part 3 : ops part 4 : building a scalable push notifications system part 5 : building a better recommendation system A couple of folks asked me about our strategy for monitoring, logging, etc. after , and having watched talk about at the it’s a good time for us to talk about how we approached ops with AWS Lambda. part 2 Chris Swan “Serverless Operations is not a Solved Problem” Serverless meetup NoOps != No Ops The notion of “ ” have often been mentioned with serverless technologies (I have done it myself), but it doesn’t mean that you no longer have to worry about ops. NoOps To me, “ops” is the umbrella term for everything related to keeping my systems and within acceptable parameters, including (but not limited to) resource provisioning, configuration management, monitoring and being on-hand to deal with any live issues. The responsibilities of keeping the systems up and running will always exist regardless whether your software is running on VMs in the cloud, on-premise hardware, or as small functions. operational performing Lambda Within your organization, someone needs to fulfill these responsibilities. It might be that you have a dedicated ops team, or perhaps your developers will share those responsibilities. to me, means in my organization — ie. no dedicated ops team — because the . As an organization it’s in your best interest to delay such specialization for as long as you can both from a financial point of view and also, perhaps more importantly, because tells us that having an ops team is the surefire way to end up with a set of operational procedures/processes, tools and infrastructure whose complexity will in turn justify the existence of said ops team. NoOps no ops specialization skills and efforts required to fulfill the ops responsibilities do not justify the need for such specialization Conway’s law At , as we our deployment pipeline became more streamlined, our toolchain became simpler and we found less need for a dedicated ops team and were in the process of disbanding our ops team altogether. Yubl migrated to a serverless architecture Logging Whenever you write to the from your function — eg. when you do in your nodejs code — it ends up in the function’s in . stdout Lambda console.log Log Group CloudWatch Logs Centralised Logging However, logs are , and once you have a dozen functions you will want to collect them in one central place. The stack is the de facto standard for centralised logging these days, you can run your own stack on and also offers a of and . not easily searchable Lambda ELK ELK EC2, elastic.co hosted version Elasticsearch Kibana To ship your logs from to you can subscribe the to a function that is responsible for shipping the logs. CloudWatch Logs ELK Log Group cloudwatch-logs-to-elk You can subscribe the manually via the AWS management console. Log Group But, you don’t want a manual step everyone needs to remember every time they create a new function. Instead, it’s better to setup a rule in to invoke a function to set up the subscription for new . Lambda CloudWatch Events subscribe-log-group Lambda Log Groups 2 things to keep in mind: lots services create logs in , so you’d want to filter by name, function logs have the prefix CloudWatch Logs Log Groups Lambda /aws/lambda/ don’t subscribe the for the function (or whatever you decide to call it), otherwise you create an infinite loop for the function where its own logs will trigger itself and produce more logs and so on Log Group cloudwatch-logs-to-elk cloudwatch-logs-to-elk Distributed Tracing Having all your logs in one easily searchable place is great, but as your architecture expands into more and more services that depends on one another you will need to to understand all the events that occurred during one user request. correlated logs from different services For instance, when a user creates a new post in the app we distribute the post to all of the user’s followers. Many things happen along this flow: Yubl user A’s client calls the legacy API to create the new post the legacy API fires a event into a stream yubl-posted Kinesis the function is invoked to handle this event distribute-yubl function calls the to find user A’s followers distribute-yubl relationship-api function then performs some business logic, group user A’s followers into batches and for each batch fires a message to a topic distribute-yubl SNS the function is invoked for each message and adds the new post to each follower’s feed add-to-feed SNS If one of user A’s followers didn’t receive his new post in the feed then the problem can lie in a number of different places. To make such investigations easier we need to be able to see the relevant logs in chronological order, and that’s where correlation IDs (eg. , , , etc.) come in. all initial request-id user-id yubl-id Because the handling of the initial user request flows through API calls, events and messages, it means the correlation IDs also need to be captured and passed through API calls, events and messages. Kinesis SNS Kinesis SNS Our approach was to roll our own client libraries which will pass the captured correlation IDs along. Capturing Correlation IDs All of our functions are created with wrappers that wraps your handler code with additional goodness such as capturing the correlation IDs into a object (which works because nodejs is single-threaded). Lambda global.CONTEXT Forwarding Correlation IDs Our HTTP client library is a thin wrapper around the HTTP client and injects the captured correlation IDs into outgoing HTTP headers. superagent We also have a client library for publishing events, which can inject the correlation IDs into the . Kinesis record payload For you can include the correlation IDs as message attributes when . SNS, publishing a message Zipkin and Amazon X-Ray Since then, AWS has announced but it’s still in preview so I have not had a chance to see how it works in practice, and it doesn’t support at the time of writing. x-ray Lambda There is also , but it requires you to run additional infrastructure on and whilst it has for instrumentation the path to adoption in the serverless environment (where you don’t have or need traditional web frameworks) is not clear to me. Zipkin EC2 wide range of support Monitoring Out of the box you get a number of basic metrics from — invocation counts, durations, errors, etc. CloudWatch You can also (eg. user created, post viewed) using the AWS SDK. However, since these are HTTP calls you have to be conscious of the latencies they’ll add for user-facing functions (ie. those serving APIs). You can mitigate the added latencies by publishing them in a fire-and-forget fashion, and/or budgeting the amount of time (say, to a max of 50ms) you can spend publishing metrics at the end of a request. publish custom metrics to CloudWatch Because you have to do everything during the invocation of a function, it forces you to make trade offs. Another approach is to take a leaf from and use special log messages and process them after the fact. For instance, if you write logs in the format like below.. Datadog’s book MONITORING|epoch_timestamp|metric_value|metric_type|metric_name console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); then you can process these log messages (see section above) and publish them as metrics instead. With this approach you’ll be trading off liveness of metrics for less API latency overhead. Logging Of course, you can employ both approaches in your architecture and use the appropriate one for each situation: for functions on the critical path (that will directly impact the latency your users experience), choose the approach of publishing metrics as special log messages; for other functions (cron jobs, kinesis processors, etc.) where invocation duration doesn’t significantly impact a user’s experience, publish metrics as part of the invocation Dashboards + Alerts We have a number of dashboards setup in as well as (using , for our legacy stack running on EC2), and they’re displayed on large monitors near the server team area. We also set up alerts against various metrics such as API latencies and error count, and have setup to alert whoever’s on-call that week. CloudWatch Graphite hostedgraphite opsgenie Consider alternatives to CloudWatch Whilst is good, cost-effective solution for monitoring (in some cases the only way to get metrics out of AWS services such as and it has its drawbacks. CloudWatch Kinesis DynamoDB) Its UI and customization is not on-par with competitors such as and , and it lacks advanced features such as anomaly detection and finding correlations that you find in and . Graphite , Datadog Sysdig Stackdrvier Wavefront The biggest limitation however, is that metrics are only granular to the minute. It means your is measured in mins (you need a few data points to separate real issues that require manual intervention from temporary blips) and consequently your is likely to be measured in tens of mins. As your scale up and the cost of unavailability goes up you need to invest efforts to cut down both times, which might mean that you need more granular metrics than is able to give you. CloudWatch time to discovery of issues time to recover CloudWatch Another good reason for not using is that, . Over the years we have experienced a number of AWS outages that impacted both our core systems running on as well as itself. As our system fails and recovers we don’t have the visibility to see what’s happening and how it’s impacting our users. CloudWatch you really don’t want your monitoring system to fail at the same time as the system it monitors EC2 CloudWatch Config Management Whatever approach you use for config management you should always ensure that: sensitive data (eg. credentials, connection strings) are encrypted in-flight, and at rest access to sensitive data should be based on roles you can easily and quickly propagate config changes Nowadays, you can add to your functions directly, and have them encrypted with environment variables Lambda KMS. This was the approach we started with, albeit using environment variables in the framework since it wasn’t a feature of the service at the time. After we had a dozen functions that share config values (eg. MongoDB connection strings) this approach became cumbersome and it was laborious and painfully slow to propagate config changes manually (by updating and re-deploying every function that require the updated config value). Serverless Lambda It was at this point in our evolution that we moved to a centralised config service. Having considered (which I know a lot of folks use) we decided to write our own using , and because: consul API Gateway Lambda DynamoDB we don’t need many of ‘s features, only the kv store consul is another thing we’d have to run and manage consul is another thing we’d have to learn consul even running with 2 nodes (you need redundancy for production) it is still order of magnitude more expensive consul some Sensitive data are encrypted (by a developer) using and stored in the in its encrypted form, when a function starts up it’ll ask the for the config values it needs and it’ll use to decrypt the encrypted blob. KMS config API Lambda config API KMS We secured access to the with api keys created in , in the event these keys are compromised attackers will be able to update config values via this API. You can take this a step further (which we didn’t get around to in the end) by securing the POST endpoint with roles, which will require developers to make to update config values. config API API Gateway IAM signed requests Attackers can still retrieve sensitive data in encrypted form, but they will not be able to decrypt them as also requires role-based access. KMS client library As most of our functions need to talk to the we invested efforts into making our client library really robust and baked in caching support and periodic polling to refresh config values from the source. Lambda config API So, that’s it folks, hope you’ve enjoyed this post, do check out the rest of the series. The emergence of and other serverless technologies have significantly simplified the skills and tools required to fulfil the ops responsibilities inside an organization. However, this new paradigm has also introduced new limitations and challenges for existing toolchains and requires us to come up with new answers. Things are changing at an incredibly fast pace, and I for one am excited to see what new practices and tools emerge from this space! AWS Lambda Like what you’re reading but want more help? I’m happy to offer my services as an and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools. independent consultant I’m based in and currently the only UK-based . I have nearly of with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve . London, UK AWS Serverless Hero 10 years experience here I can also run an to help you get with your serverless architecture. You can find out more about the two-day workshop , which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices. in-house workshops production-ready here If you prefer to study at your own pace, then you can also find all the same content of the workshop as a I have produced for Manning. We will cover topics including: video course authentication authorization with API Gateway Cognito & & testing running functions locally & CI/CD log aggregation monitoring best practices distributed tracing with X-Ray tracking correlation IDs performance cost optimization & error handling config management canary deployment VPC security leading practices for Lambda, Kinesis, and API Gateway You can also get the face price with the code . Hurry though, this discount is only available while we’re in Manning’s Early Access Program (MEAP). 40% off ytcui