The common practice of using agents/daemons to buffer and batch send logs and metrics are no longer applicable in the world of serverless. Here are some tips to help you get the most out of your logging and monitoring infrastructure for your functions. This is part 2 of a 3-part mini series on managing your AWS Lambda logs. If you haven’t read yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post. part 1 part 1 : centralise logging part 3 : tracking correlation IDs New paradigm, new problems Much have changed with the serverless paradigm, and it solves many of the old problems we face and replaced them with some new problems that (I think) are easier to deal with. Consequently, many of the old practices are no longer applicable — eg. using agents/daemons to buffer and batch send metrics and logs to monitoring and log aggregation services. However, even as we throw away these old practices for the new world of serverless, we are still after the same qualities that made our old tools “good”: able to collect rich set of system and application metrics and logs publishing metrics and logs should not add user-facing latency (ie. they should be performed in the background) metrics and logs should appear in realtime (ie. within a few seconds) metrics should be granular Unfortunately, the current tooling for Lambda — CloudWatch metrics & CloudWatch Logs — are failing on a few of these, some more so than others: publishing custom metrics requires additional network calls that need to be made during the function’s execution, adding to user-facing latency CloudWatch metrics for AWS services are only granular down to 1 minute interval (custom metrics can be ) granular down to 1 second CloudWatch metrics are often a few minutes behind (though custom metrics might have less lag now that they can be recorded at 1 second interval) CloudWatch Logs are usually more than 10s behind (not precise measurement, but based on personal observation) With Lambda, we have to rely on AWS to improve CloudWatch in order to bring us parity with existing “server-ful” services. Many vendors have announced support for Lambda, such as and . However, as they are using the same metrics from CloudWatch they will have the same lag. Datadog Wavefront is a popular alternative for monitoring Lambda functions and they do things slightly differently — by giving you a wrapper function around your code so they can inject monitoring code (it’s a to those who have used AOP frameworks in the past). IOPipe familiar pattern For their 1.0 release they also announced support for tracing (see the demo video below), which I think it’s interesting as AWS already offers X-Ray and it’s a more complete tracing solution (despite its own shortcomings as I mentioned in ). this post IOPipe seems like a viable alternative to CloudWatch, especially if you’re new to AWS Lambda and just want to get started quickly. I can totally see the value of that simplicity. However, I have some serious reservations with IOPipe’s approach: A wrapper around every one of my functions? This level of pervasive access to my entire application requires a serious amount of trust that has to be earned, especially in times like . this CloudWatch collects logs and metrics asynchronously without adding to my function’s execution time. But with IOPipe they have to send the metrics to their own system, and they have to do so during my function’s execution time and hence (for APIs). adding to user-facing latency Further to the above points, it’s another thing that can cause my function to error or time out even after my code has successfully executed. Perhaps they’re doing something smart to minimise that risk but it’s hard for me to know for sure and I have to anticipate failures. Of all the above, the latency overhead is the biggest concern for me. Between API Gateway and Lambda I already have to deal with cold start and the latency between API Gateway and Lambda. As your microservice architecture expands and the no. of inter-service communications grows, these latencies will compound further. For background tasks this is less a concern, but a sizeable portion of Lambda functions I have written have to handle HTTP requests and I need to keep the execution time as low as possible for these functions. _The Road So Far_medium.com Yubl’s road to Serverless — Part 1, Overview Sending custom metrics asynchronously I find Datadog’s approach for sending custom metrics very interesting. Essentially you write custom metrics as specially-formatted log messages that Datadog will process (you have to set up IAM permissions for CloudWatch to call their function) and track them as metrics. Datadog allows you to send custom metrics using log messages in their DogStatsD format. It’s a simple and elegant approach, and one that we can adopt for ourselves even if we decide to use another monitoring service. In part 1 we established an infrastructure to ship logs from CloudWatch Logs to a log aggregation service of our choice. We can extend the log shipping function to look for log messages that look like these: Log custom metrics as specially formatted log messages For these log messages, we will interpret them as: MONITORING|metric_value|metric_unit|metric_name|metric_namespace And instead of sending them to the log aggregation service, we’ll send them as metrics to our monitoring service instead. In this particular case, I’m using CloudWatch in my demo (see link below), so the format of the log message reflects the fields I need to pass along in the call. PutMetricData To send custom metrics, we write them as log messages. Again, as Lambda service collects these for us and sends them to CloudWatch in the background. no latency overhead And moments later they’re available in CloudWatch metrics. Custom metrics are recorded in CloudWatch as expected. Take a look at the function in this . custom-metrics repo _lambda-logging-metrics-demo - How to apply Datadog's approach for sending custom metrics asynchronously._github.com theburningmonk/lambda-logging-metrics-demo Tracking the memory usage and billed duration of your AWS Lambda functions in CloudWatch Lambda reports the amount of memory used, and the billed duration at the end of every invocation. Whilst these are not published as metrics in CloudWatch, you can find them as log messages in CloudWatch Logs. At the end of every invocation, Lambda publishes a REPORT log message detailing the max amount of memory used by your function during this invocation, and how much time is billed (Lambda charges at 100ms blocks). I rarely find memory usage to be an issue as Nodejs functions have such a small footprint. My choice of memory allocation is primarily based on getting the right balance between cost and performance. In fact, Alex Casalboni of CloudAcademy wrote a very nice on using Step Functions to help you find that sweet spot. blog post _During the last few months, I realized that most developers using serverless technologies have to rely on blind choices…_serverless.com AWS Lambda Power Tuning with AWS Step Functions The on the other hand, is a useful metric when viewed side by side with . It gives me a rough idea of the amount of wastage I have. For example, if the average of a function is 42ms but the average is 100ms, then there is a 58% wastage and maybe I should consider running the function on a lower memory allocation. Billed Duration Invocation Duration Invocation Duration Billed Duration Interestingly, IOPipe records these in their dashboard out of the box. IOPipes records a number of additional metrics that are not available in CloudWatch, such as Memory Usage and CPU Usage over time, as well as coldstarts. However, we don’t need to add IOPipe just to get these metrics. We can apply a similar technique to the previous section and publish them as custom metrics to our monitoring service. To do that, we have to look out for these log messages and parse the relevant information out of them. Each message contains 3 pieces of information we want to extract: REPORT Billed Duration (Milliseconds) Memory Size (MB) Memory Used (MB) We will parse these log messages and return an array of CloudWatch metric data for each, so we can flat map over them afterwards. This is a function in the “parse” module, which maps a log message to an array of CloudWatch metric data. Flat map over the CloudWatch metric data returned by the above parse.usageMetrics function and publish them. And sure enough, after subscribing the log group for an API (created in the same demo project to test this) and invoking the API, I’m able to see these new metrics show up in CloudWatch metrics. Looking at the graph, maybe I can reduce my cost by running it on a much smaller memory size. Take a look at the function in this . usage-metrics repo _lambda-logging-metrics-demo - How to apply Datadog's approach for sending custom metrics asynchronously._github.com theburningmonk/lambda-logging-metrics-demo Mind the concurrency! When processing CloudWatch Logs with Lambda functions, you need to be mindful of the no. of concurrent executions it creates so to not run foul of the . concurrent execution limit Since this is an account-wide limit, it means your log-shipping function can cause throughout your entire application. Critical functions can be throttled because too many executions are used to push logs out of CloudWatch Logs — not a good way to go down ;-) cascade failures What we need is a more fine-grained throttling mechanism for Lambda. It’s fine to have an account-wide limit, but we should be able to create pools of functions that can have slices of that limit. For example, tier-1 functions (those serving the core business needs) gets 90% of the available concurrent executions. Whilst tier-2 functions (BI, monitoring, etc.) gets the other 10%. As things stand, we don’t have that, and the best you can do is to keep the execution of your log-shipping function brief. Maybe that means fire-and-forget when sending logs and metrics; or send the decoded log messages into a Kinesis stream where you have more control over parallelism. Or, maybe you’ll monitor the execution count of these tier-2 functions and when the no. of executions/minute breaches some threshold you’ll temporarily unsubscribe log groups from the log-shipping function to alleviate the problem. Or, maybe you’ll install some bulkheads by moving these tier-2 functions into a separate AWS account and use to trigger them. But this seems a really heavy-handed way to workaround the problem! cross-account invocation _Public and cross-account functions on Serverless platforms such as AWS Lambda offer compelling use-cases to build non…_read.iopipe.com Public & Cross-account Functions on AWS Lambda Point is, it’s not a solved problem and I haven’t come across a satisfying workaround yet. AWS is aware of this gap and hopefully they’ll add support for better control over concurrent executions. Hi, my name is . I’m an and the author of . I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless. Yan Cui AWS Serverless Hero Production-Ready Serverless You can contact me via , and . Email Twitter LinkedIn Check out my new course, . Complete Guide to AWS Step Functions In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices. Get your copy . here Come learn about operational for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more. BEST PRACTICES You can also get off the face price with the code . 40% ytcui Get your copy . here

The Graph

Amazon

Twitter

Capture and forward correlation IDs through different Lambda event sources

Tips and tricks for logging and monitoring AWS Lambda functions

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

6 Tips To Scale an AppSync Project To 200+ Resolvers That Will Blow Your Mind

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

6 Tips To Scale an AppSync Project To 200+ Resolvers That Will Blow Your Mind

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps