part 1 : new challenges to observability part 2 : 1st party observability tools from AWS [this post] part 3 : 3rd party observability tools part 4: the future of Serverless observability In part 1 we talked about the challenges serverless brings to the table. In this post, let’s look at 1st party tools from AWS Out of the box we get a bunch of tools provided by AWS itself: CloudWatch for monitoring, alerting and visualization CloudWatch Logs for logs X-Ray for distributed tracing Amazon ElasticSearch for log aggregation CloudWatch Logs Whenever you write to , those outputs are captured by the Lambda service and sent to CloudWatch Logs as logs. This is , as it’s provided by the platform. stdout one of the few background processing you get All the log messages (technically they’re referred to as ) for a given function would appear in CloudWatch Logs under a single . events Log Group As part of a , you have many . Each contains the logs from one concurrent execution (or ) of your function, so there’s a one-to-one mapping. Log Group Log Streams container So that’s all well and good, but it’s . There’s currently . Whilst AWS has been improving the service, it still pales in comparison to other alternatives on the market. not easy to search for log messages in CloudWatch Logs no way to search the logs for multiple functions at once It might suffice as you start out, but you’ll probably find yourself in need of something more soon after. Fortunately, it’s straightforward to get your logs out of CloudWatch Logs. You can stream them to Amazon’s hosted Elasticsearch service. But don’t expect it to be a like-for-like experience with your self-hosted ELK stack though. wrote a detailed on some of the problems they ran into when using Amazon Elasticsearch at scale. Please give that a read if you’re thinking about adopting Amazon Elasticsearch. Liz Bennett post Alternatively, you can stream the logs to a Lambda function, and ship them to a log aggregation service of your choice. I won’t go into detail here as I have written about it at length previously, just go and read instead. this post You can stream logs from CloudWatch Logs to just about any log aggregation service, via Lambda. CloudWatch Metrics With CloudWatch, you get some basic metrics out of the box. Invocation count, error count, invocation duration, etc. All the basic telemetry about the health of a function. But CloudWatch is missing some valuable data points, such as: estimated costs : CloudWatch only report this for functions with reserved concurrency concurrent executions cold starts : Lambda reports this in CloudWatch Logs, at the end of every invocation. Because Lambda invocations are billed in 100ms blocks, a 102ms invocation would be billed for 200ms. It will be a useful metric to see alongside to identify cost optimizations) billed duration Invocation Duration : Lambda reports this in CloudWatch Logs too, but it’s not recorded in CloudWatch memory usage You get 6 basic metrics about the health of a function. There are ways to record and track these metrics yourself, see on how to do that. Other providers like IOPipe (more on them in the next post) would also report these data points out of the box. this post You can set up Alarms in CloudWatch against any of these metrics, here are some good candidates: throttled invocations : set threshold based on % of your current regional limit regional concurrent executions tail (95 or 99 percentile) latency against some acceptable threshold 4xx and 5xx errors on API Gateway And you can set up basic dashboard in CloudWatch too, at $3 per month per dashboard (first 3 are free). X-Ray For distributed tracing, you have X-Ray. To make the most of tracing, you should instrument your code to gain even better visibility. Like CloudWatch Logs, . It’s a background processing that the platform provides for you. collecting traces do not add additional time to your function’s invocation From the tracing data, X-Ray can also show you a service map like this one. X-Ray gives you a lot of insight into the runtime performance of a function. However, , the distributed aspect is severely undercooked. As it stands, . its focus is narrowly on one function X-Ray currently doesn’t trace over API Gateway, or asynchronous invocations such as SNS or Kinesis It’s good for homing in on performance issues for a particular function. But it offers little to help you build intuition about how your system operates . For that, I need to step away from what happens inside one function, and be able to look at the . as a whole entire call chain After all, when the engineers at Twitter were talking about the need for , it wasn’t so much to help them debug performance issues of any single endpoint, but to help them make sense of the behaviour and of their system. A system that is essentially one big, complex and highly connected graph of services. observability performance With Lambda, this graph is going to become a lot more complex, more sparse and more connected because: instead of one service with 5 endpoints, you now have 5 functions functions are connected through a greater variety of mediums — SNS, Kinesis, API Gateway, IoT, you name it event-driven architecture has become the norm Our tracing tools need to help us make sense of this graph. They need to help us between our functions. And they need to help us follow data as it enters our system as a user request, and reaches out to far corners of this graph through both synchronous and asynchronous events. visualize the connections And of course, X-Ray do not span over non-AWS services such as Auth0, or Google BigQuery, or Azure functions. But those of us deep in the serverless mindset see the world through SaaS-tinted glasses. We want to use the service that best addresses our needs, and them together with Lambda. glue At Yubl, we . Auth0, Google BigQuery, GrapheneDB, MongoLab, and Twillio to name a few. And it was great, we don’t have to be bound by what AWS offers. used a number of non-AWS services from Lambda My good friend Raj also did a good talk at NDC on how he uses services from both AWS and Azure in his wine startup. You can watch his talk . here And finally, I think of our system like a brain. Like a brain, our system is made up of: (functions) neurons (connections between functions) synapses and (data) that flow through them electrical signals Like a brain, our system is alive, it’s constantly changing and evolving and it’s constantly working! And yet, when I look at my dashboards and my X-Ray traces, that’s not what I see. Instead, I see a tabulated list that does not reflect the and . It doesn’t help me build up any intuitive understanding of what’s going on in my system. movement of data areas of activity A brain surgeon wouldn’t accept this as the primary source of information. How can they possibly use it to build a mental picture of the brain they need to cut open and operate on? I should add that this is not a criticism of X-Ray, it is built the same way most observability tools are built. But maybe our tools need to evolve beyond human computer interfaces (HCI) that wouldn’t look out of place on a clipboard (the physical kind, if you’re old enough to have seen one!). And it actually reminds me of one of Bret Victor’s seminal talks, . stop drawing dead fish Netflix made great strides towards this idea of a live dashboard with . Which they have also kindly . Vizceral open sourced Conclusions AWS provides us with some decent tools out of the box. Whilst they each have their shortcomings, they’re to get started with. good enough As 1st party tools, they also enjoy home field advantages over 3rd party tools. For example, Lambda collects logs and traces without adding to your function invocation time. Since we can’t access the server anymore, 3rd party tools cannot perform any background processing. Instead they have to resort to workarounds or are forced to collect data synchronously. However, as our serverless applications become more complex, these tools need to either evolve with us or they will need to be replaced in our stack. CloudWatch Logs for instance, cannot search across multiple functions. It’s often the first piece that need to be replaced once you have more than a dozen functions. In the next post, we will look at some 3rd party tools such as , and . We will discuss their value-add proposition as well as their shortcomings. IOPipe Dashbird Thundra Like what you’re reading but want more help? I’m happy to offer my services as an and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools. independent consultant I’m based in and currently the only UK-based . I have nearly of with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve . London, UK AWS Serverless Hero 10 years experience here I can also run an to help you get with your serverless architecture. You can find out more about the two-day workshop , which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices. in-house workshops production-ready here If you prefer to study at your own pace, then you can also find all the same content of the workshop as a I have produced for Manning. We will cover topics including: video course authentication authorization with API Gateway Cognito & & testing running functions locally & CI/CD log aggregation monitoring best practices distributed tracing with X-Ray tracking correlation IDs performance cost optimization & error handling config management canary deployment VPC security leading practices for Lambda, Kinesis, and API Gateway You can also get the face price with the code . Hurry though, this discount is only available while we’re in Manning’s Early Access Program (MEAP). 40% off ytcui