This is part 2 of a 3-part mini series on managing your AWS Lambda logs.
If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.
part 1 : centralise logging
part 3 : tracking correlation IDs
Much have changed with the serverless paradigm, and it solves many of the old problems we face and replaced them with some new problems that (I think) are easier to deal with.
Consequently, many of the old practices are no longer applicable — eg. using agents/daemons to buffer and batch send metrics and logs to monitoring and log aggregation services. However, even as we throw away these old practices for the new world of serverless, we are still after the same qualities that made our old tools “good”:
Unfortunately, the current tooling for Lambda — CloudWatch metrics & CloudWatch Logs — are failing on a few of these, some more so than others:
With Lambda, we have to rely on AWS to improve CloudWatch in order to bring us parity with existing “server-ful” services.
Many vendors have announced support for Lambda, such as Datadog and Wavefront. However, as they are using the same metrics from CloudWatch they will have the same lag.
IOPipe is a popular alternative for monitoring Lambda functions and they do things slightly differently — by giving you a wrapper function around your code so they can inject monitoring code (it’s a familiar pattern to those who have used AOP frameworks in the past).
For their 1.0 release they also announced support for tracing (see the demo video below), which I think it’s interesting as AWS already offers X-Ray and it’s a more complete tracing solution (despite its own shortcomings as I mentioned in this post).
IOPipe seems like a viable alternative to CloudWatch, especially if you’re new to AWS Lambda and just want to get started quickly. I can totally see the value of that simplicity.
However, I have some serious reservations with IOPipe’s approach:
Of all the above, the latency overhead is the biggest concern for me. Between API Gateway and Lambda I already have to deal with cold start and the latency between API Gateway and Lambda. As your microservice architecture expands and the no. of inter-service communications grows, these latencies will compound further.
For background tasks this is less a concern, but a sizeable portion of Lambda functions I have written have to handle HTTP requests and I need to keep the execution time as low as possible for these functions.
Yubl’s road to Serverless — Part 1, Overview_The Road So Far_medium.com
I find Datadog’s approach for sending custom metrics very interesting. Essentially you write custom metrics as specially-formatted log messages that Datadog will process (you have to set up IAM permissions for CloudWatch to call their function) and track them as metrics.
Datadog allows you to send custom metrics using log messages in their DogStatsD format.
It’s a simple and elegant approach, and one that we can adopt for ourselves even if we decide to use another monitoring service.
In part 1 we established an infrastructure to ship logs from CloudWatch Logs to a log aggregation service of our choice. We can extend the log shipping function to look for log messages that look like these:
Log custom metrics as specially formatted log messages
For these log messages, we will interpret them as:
MONITORING|metric_value|metric_unit|metric_name|metric_namespace
And instead of sending them to the log aggregation service, we’ll send them as metrics to our monitoring service instead. In this particular case, I’m using CloudWatch in my demo (see link below), so the format of the log message reflects the fields I need to pass along in the PutMetricData call.
To send custom metrics, we write them as log messages. Again, no latency overhead as Lambda service collects these for us and sends them to CloudWatch in the background.
And moments later they’re available in CloudWatch metrics.
Custom metrics are recorded in CloudWatch as expected.
Take a look at the custom-metrics
function in this repo.
theburningmonk/lambda-logging-metrics-demo_lambda-logging-metrics-demo - How to apply Datadog's approach for sending custom metrics asynchronously._github.com
Lambda reports the amount of memory used, and the billed duration at the end of every invocation. Whilst these are not published as metrics in CloudWatch, you can find them as log messages in CloudWatch Logs.
At the end of every invocation, Lambda publishes a REPORT log message detailing the max amount of memory used by your function during this invocation, and how much time is billed (Lambda charges at 100ms blocks).
I rarely find memory usage to be an issue as Nodejs functions have such a small footprint. My choice of memory allocation is primarily based on getting the right balance between cost and performance. In fact, Alex Casalboni of CloudAcademy wrote a very nice blog post on using Step Functions to help you find that sweet spot.
AWS Lambda Power Tuning with AWS Step Functions_During the last few months, I realized that most developers using serverless technologies have to rely on blind choices…_serverless.com
The Billed Duration
on the other hand, is a useful metric when viewed side by side with Invocation Duration
. It gives me a rough idea of the amount of wastage I have. For example, if the average Invocation Duration
of a function is 42ms but the average Billed Duration
is 100ms, then there is a 58% wastage and maybe I should consider running the function on a lower memory allocation.
Interestingly, IOPipe records these in their dashboard out of the box.
IOPipes records a number of additional metrics that are not available in CloudWatch, such as Memory Usage and CPU Usage over time, as well as coldstarts.
However, we don’t need to add IOPipe just to get these metrics. We can apply a similar technique to the previous section and publish them as custom metrics to our monitoring service.
To do that, we have to look out for these REPORT
log messages and parse the relevant information out of them. Each message contains 3 pieces of information we want to extract:
We will parse these log messages and return an array of CloudWatch metric data for each, so we can flat map over them afterwards.
This is a function in the “parse” module, which maps a log message to an array of CloudWatch metric data.
Flat map over the CloudWatch metric data returned by the above parse.usageMetrics function and publish them.
And sure enough, after subscribing the log group for an API (created in the same demo project to test this) and invoking the API, I’m able to see these new metrics show up in CloudWatch metrics.
Looking at the graph, maybe I can reduce my cost by running it on a much smaller memory size.
Take a look at the usage-metrics
function in this repo.
theburningmonk/lambda-logging-metrics-demo_lambda-logging-metrics-demo - How to apply Datadog's approach for sending custom metrics asynchronously._github.com
When processing CloudWatch Logs with Lambda functions, you need to be mindful of the no. of concurrent executions it creates so to not run foul of the concurrent execution limit.
Since this is an account-wide limit, it means your log-shipping function can cause cascade failures throughout your entire application. Critical functions can be throttled because too many executions are used to push logs out of CloudWatch Logs — not a good way to go down ;-)
What we need is a more fine-grained throttling mechanism for Lambda. It’s fine to have an account-wide limit, but we should be able to create pools of functions that can have slices of that limit. For example, tier-1 functions (those serving the core business needs) gets 90% of the available concurrent executions. Whilst tier-2 functions (BI, monitoring, etc.) gets the other 10%.
As things stand, we don’t have that, and the best you can do is to keep the execution of your log-shipping function brief. Maybe that means fire-and-forget when sending logs and metrics; or send the decoded log messages into a Kinesis stream where you have more control over parallelism.
Or, maybe you’ll monitor the execution count of these tier-2 functions and when the no. of executions/minute breaches some threshold you’ll temporarily unsubscribe log groups from the log-shipping function to alleviate the problem.
Or, maybe you’ll install some bulkheads by moving these tier-2 functions into a separate AWS account and use cross-account invocation to trigger them. But this seems a really heavy-handed way to workaround the problem!
Public & Cross-account Functions on AWS Lambda_Public and cross-account functions on Serverless platforms such as AWS Lambda offer compelling use-cases to build non…_read.iopipe.com
Point is, it’s not a solved problem and I haven’t come across a satisfying workaround yet. AWS is aware of this gap and hopefully they’ll add support for better control over concurrent executions.
Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
You can contact me via Email, Twitter and LinkedIn.
Check out my new course, Complete Guide to AWS Step Functions.
In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.
Get your copy here.
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
You can also get 40% off the face price with the code ytcui.
Get your copy here.