Observability is about the ability to troubleshoot unknown issues that might happen in your application. If you are not familiar with it, I recommend you to watch and from QCon 2018. How to Build Observable Distributed Systems The Present and Future of Serverless Observability In this article, I’m going to explain you how some of the most prominent observability tools¹ have performed against my test scenarios, meanwhile I complement it by providing an overview of each tool’s pros and cons. I have tested those tools against my Node.js based Serverless app. It’s deployed on and uses Proxy Integration. You can find the code . Serverless AWS in my Github P.S: Since I’ve published this post, some vendors have improved their negative points / test results. I’m planning to write a revised blog post, test their claims and reflect their changes. Until then, please read their comments at the end of this article, to get notified about their pro-claimed improvements. Table of Contents Test Scenarios AWS X-RAY Dashbird Thundra IOPipe Workaround Conclusion Test Scenarios I have tested all those tools against three scenarios by performing load testing with Gatling: Test Scenario 1: Lambda function times out, thanks to high number of concurrent users and repetition. Just as reminder: Currently, Lambda built in metric “Throttles” doest not show time out errors. Test Scenario 2: Lambda function ; “body” property hasn’t been stringified. This error can happen due to negligence. doesn’t have proper format Test Scenario 3: Dynamodb throws ConditionalCheckFailedException because the app tries to record an item with duplicate value for partition key. In scenarios 1 and 2, user receives and the vague response “internal server error”. That’s why a proper observability tool is needed to troubleshoot these cases, especially in case of big distributed application. 502 Bad Gateway If a tool has passed a test, means it was able to detect the problem, and to show it to the admin. This can enable him/her for faster troubleshooting. If tool has failed to meet the before-mentioned criteria, result is marked as failed. However, some tools have partly passed tests, in this case I’ve explained their behaviour. AWS X-RAY Results There has been a new improvement since few weeks ago: Now in case of time out error, X-Ray shows that there is an error, but still doesn’t clarify what is the problem. So it expects the user to guess it or troubleshoot it with other tools. Also, UI is confusing: in one place it indicates that there is an error, but few lines under, it indicates that there is no error(pic1). Apparently, their new improvement is still under development. Test 1: Partly passed: Everything looks ok according to X-Ray, even though end user gets error. Test 2: Failed. It indicates that there is error and also shows the exception stack trace. Test 3: Passed. Pic1: X-Ray doesn’t clarify what is the error type. Also UI is contradictory, by hovering on the clock icon, it shows “no faults or error”. Pros Lambda has built in agent for X-Ray. The agent sends data as batches and in asynchronous way, so using it doesn’t add extra latency to your function. Responsive support team. You can communicate with them in AWS Developers forum Managed service by AWS. So, its supposed to be more rich in functionalities, and to be more compliant with AWS best practices. Cons Confusing UI: Besides pic1, you can look at pic2: even though I have chosen to see buggy traces, I see a confusing “200” response for each trace. The 200 response indicates the X-Ray service has returned response and doesn’t mean that trace is successful. This is not what most users expect to see and it can lead to wrong interpretation. Yan Cui has addressed this in his blog post . Worse, this issue hasn’t been solved even though a year has passed. aws x-ray and lambda : the good, the bad and the ugly Slow paced integration: Still, are integrated with X-Ray. So if you are using DynamoDb or S3, X-Ray provides you inferred segments (which means lack of details), because even till the time of writing this article, those services haven’t been actively integrated with X-Ray. few of AWS services Immature: There is still room for improvement. For example, they need to add more functionalities for better debugging, especially to include in an easier and more neat way. . custom errors Apparently, they are working to make Lambda generated segment accessible Pic2: Buggy traces are shown with a 200 response, which is confusing. Dashbird Results Test 1: Passed Test 2: Failed . Dashbird’ UI is not clear and can be confusing: At high level, it shows the trace as successful but digging into the trace, it shows the error (pic3). Surprisingly, this is unlike X-ray’s result, even though Dashbird’s trace is based on X-Ray. Its behaviour can be justified so that “an exception might not necessarily mean the function has failed.” This is true, but if there is an exception in the trace, user should be at least notified of it (e.g. from high level picture), and showing just a green and pretty “success” on the trace, is misleading. Meanwhile in my opinion, having exception usually means something is wrong and worths investigation, unless the exception has been proactively caught and handled. Test 3: Partly passed Pic3: Dashbird shows a buggy trace as successful, and user should investigate all traces, dig them thoroughly to find if there is an exception. Pros Is very easy to setup, Dashbird’s CloudFormation template does almost everything. Once CloudFormation stack is set up, Dashbid starts observing all your functions in all regions (this can be bad from security point of view. I have addressed this in Cons section) Has nice UI and some nice features like live tailing, enabling for real time monitoring of specific functions, as well as alerting feature. Gets data from CloudWatch logs and AWS X-Ray. So it doesn’t add extra latency to your functions. Has a friendly and supportive custom service. Shows logs from the whole function execution time, not just the error. This makes debugging easier (pic4) Pic4: Example log of a buggy trace, Dashbird shows logs from the whole function execution time. Cons Documentation is outdated and misleading: In “getting started” section, it asks you to set up a specific IAM policy. But this policy doesn’t have any effect, and it’s there just because documentation hasn’t been updated. Provides few, basic and somewhat misleading statistics, e.g. average duration and average memory usage. so the team needs to provide percentile-based statistics. Average is a misleading factor for web performance analysis, Shows cold start but individually. Cold start statistics are needed. Inherits limitations of CloudWatch (e.g. granularity, delay) and X-Ray, because it’s based on them. Has a major security concern: it has access to all your data, and you cannot limit it. You can filter data in the Dashbird app, but this doesn’t stop it from receiving your data. Acknowledgment: Thanks to Taavi Rehemägi, co-founder of Dashbird, for extending my trial period and enabling me to investigate their SaaS. Thundra Results Test 1: failed Test 2: failed Test 3: failed Pros Easy to set up. Instruments the code, this enables Thundra to provide deep technical overview. Also, this can be an alternative to AWS X-Ray. Takes advantage of so doesn’t add extra latency to function execution time. asynchronous publishing Has informative diagram and dashboards. Also, it provides a clear list of functions, accompanied with their statistics (including total number of cold starts that each function has experienced) Has more conservative approach toward security, than e.g. Dashbird. It doesn’t need to access to all your data. Is based , and it’s great that there is at least an alternative for X-Ray. on a good idea Has friendly and supportive customer support I talked with its product manager about support for Node.js app. Apparently, Thundra is focused on Java applications, and its Node.js related features are far behind. I haven’t had time to investigate its Java features, but if someone is using Java based Serverless app, I recommend him/her to take a look at Thundra. Cons Bad support for Node.js. At the time of writing this article, I don’t see any convincing reason to observe my Node.js app with Thundra. Statistics are based on average, . which is misleading IOPipe Results Test 1: passed Test 2: failed Test 3: failed Pros Provides few percentile based statistics Easy to set up Has alerting feature Has search functionality: You can search through your invocations by different keywords, e.g. requestId from CW. However my rough and initial guess is that, to troubleshoot a complex distributed application, you may need to use a sophisticated and comprehensive logging tool rather than IOPipe’s. But this is of course up to your use case. Cons In their current approach, IOPpipe sends data to it’s own system synchronously; this adds extra latency to your function execution time and is against best practices described in “Capture the metric within your Lambda function code and log it using the provided logging mechanisms in Lambda.” IOPipe approach has been investigated further in as well as . This is a serious concern, that’s why I don’t recommend use of IOPpipe, unless they resolve this issue. It seems that their team working on this and trying to come up with an . Let’s wait for the result! Serverless Architectures with AWS Lambda Tips and tricks for logging and monitoring AWS Lambda functions Dashbird vs Datadog vs IOpipe asynchronous and optimised alternative Doesn’t show statistics for cold start Doesn’t provide tracing. Shows log just for the error stack trace, which might not be very convenient. It would be helpful if it shows log from the whole function, similar to what Dashbird does. Workaround To detect errors, you can act proactively and use monitoring, instead of or in collaboration with observability tools. You are advised to use CloudWatch. Lambda has built in agent to send logs to CloudWatch, and using it doesn’t add extra latency to your functions, unless you publish Custom Metrics. To achieve the goal in an optimised way, you can . For example, for error scenario 1, your Metric Filter can have a Filter Pattern such as “Task timed out after”. Then, Metric filter searches in your log events and whenever it finds a match, it increments value of corresponding CloudWatch metric. Subsequently, you can set an CloudWatch alarm for that metric and publish it e.g. via SNS. Also, to get full potentiality of CloudWatch, you can use by using JSON. create Metric Filter Structured Logging Conclusion I haven’t had time to investigate all serverless observability tools, however based on my my investigation on the most prominent ones, all of tools are immature, or incomplete at some level, and need to improve. There is no single solution that you can use to thoroughly observe your distributed app perfectly. Surprisingly some issues, like error 2, hasn’t been addressed by any of the solutions (not even in CloudWatch logs). But that error can happen due to negligence, as I encountered it when my friend was wondering why end user just gets error and asked me to debug his app. Everything looked ok, no error or exception, but end user was getting error. After debugging it for around an hour, the only thing came to my mind was the outputting format. And I was right. He forgot to JSON.strinigfy() the body property of his function’s output, and AWS Proxy Integration was failing silently. It was a simple application, but imagine if this would happen in a big and complex distributed app? How are you supposed to find it out? To achieve observability, you need to use different solutions in tandem and also take help of deep monitoring, by Structured Logging. Pierre Vincent has addressed this during his QCon presentation . How to Build Observable Distributed Systems My 3 test scenarios are just examples. Do you think what are other issues & errors that should be in priority to be observed? Do you know about any other tool that excel the above mentioned tools? What’s your opinion about current status of serverless observability IN PRACTICE? Footnotes This is just my opinion.