OK, I understand that it is kind of funny to have a monitor to your monitors and no, this is not a halloween trick. The use case here is something that maybe a lot of you out there had faced before like I did.
Let’s think for a second about a simple, common scenarios that all we IT guys deal with in a daily basis. You have your infrastructure, your applications and you fine monitoring which gather metrics from all sorts of components and aggregates it in CloudWatch (or any other place) so you can joyfully see all those colorful charts. One day your client calls you and informs you that he cannot use your services due to an issue with your API or Website. You start wondering what is happening as you “didn’t receive any alerts”; by the time you get to your awesome metrics you see that all of them are empty because the application that feeds it to CloudWatch has DIED! #dramaticbeaver
Here at my work we have such a setup and, of course, we had felt this pain and, of course, it was on a weekend and, of course, it was past 10 PM and, of course, we were only alerted by a client who, of course, was having timeouts. #opslife
In our environment we have a lot of micro services, backend and user facing applications that all outputs data that we grab and ship to CloudWatch so we can build metrics, charts and get alerted by them. This is an easy to build AWS managed setup that suits us really well and help us troubleshoot all sorts of issues. The problem is that our software that was grouping and shipping those metrics is not bug free. As any other software it can fail and it WILL FAIL.
The first idea any IT profession will come up with is: “LET’S MONITOR IT!”. Ok, this is nice, but if you think for a second you are creating a monitor for a monitor, and this new monitor can also fails, so maybe in the future you will need a new monitor to monitor the monitor for your monitors. But hey! This new monitor can also fail so maybe we should also have a monitor to monitor the new monitor that is monitoring the monitor of your monitor — I think you know where I am going with this, right?
So, we need a new monitor. Period. But one that we can trust like 99% (I’LL NEVER SAY THIS IS 100% GUARANTEED) of the time and that we will not end up falling on the infinite loop of monitors.
You have your monitor or metric shipper, call it whatever you want, and you need to make sure all the metrics you push to CloudWatch are actually there. Thank Bezos, AWS has an awesome API to all their services and CloudWatch is not an exception so, let’s build a Lambda function that will simply read the metrics you have and in case it cannot find the metric it will alert you. In this example I am generating an PagerDuty alert but you can change it for whatever tool you have. For example, you can push Slack notification as well.
Here in this gist you may find the lambda code.
An overview of the code:
- main(): will iterate over our metrics, try to read the metric from AWS CloudWatch service and count the total number of metrics that it had errors during this process. If the total amount it greater than zero then it will generate an incident alert;
- pagerduty(): this one will basically create the incident for us;
- lambda_handler(): the AWS Lambda service will need to call a method inside your code and it will need a function that receives two parameters, the event and the context, we are simply ignoring most of it and only using the Event ID to output that to our logs;
In order to create the lambda you will need to generate a zip code containing not only the code above but also the libraries you will. In this case we are only using some core libraries and the Boto3 package that you need to install in the same directory of your code. Refer to this guide here in order to do that in your computer, the procedure to do that will vary depending on your OS.
Creating the actual lambda function is not the scope of this article but you can use this document in order to do that. Basically, what you need is:
- A zip file containing the source code of the lambda that you created following the steps provided in the above link;
- You will need to create a CloudWatch Event Rule so you can use as a scheduled trigger to your lambda (I am running my link a cron job every 5 minutes);
- An IAM Role that your lambda function will use and this role must have read access to CloudWatch;
Now you have a cost effective, simple but trustworthy solution to monitor your monitors and alert you before your customer does!