paint-brush
auto-create CloudWatch Alarms for APIs with Lambdaby@theburningmonk
1,691 reads
1,691 reads

auto-create CloudWatch Alarms for APIs with Lambda

by Yan CuiMay 13th, 2018
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Yan Cui is an AWS Serverless Hero and the author of Production-Ready Serverless. Yan explains how to use CloudTrail and CloudWatch Events to automate many day-to-day operational steps with Lambda. These are manual steps that often get missed, but can be easily automated using Lambda and API Gateway. Yan is using the serverless-iam-roles-per-function plugin to give the function a tailored IAM role. The function needs the.apigateway:PATCHpermission to enable detailed metrics,. create alarms for each endpoint, and create CloudWatch Alarms for p99 latencies and error counts.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - auto-create CloudWatch Alarms for APIs with Lambda
Yan Cui HackerNoon profile picture

In a pre­vi­ous post we dis­cussed how to auto-sub­scribe a Cloud­Watch Log Group to a Lamb­da func­tion using Cloud­Watch Events. So that we don’t need a man­u­al process to ensure all Lamb­da logs would go to our log aggre­ga­tion ser­vice.

Whilst this is use­ful in its own right, it only scratch­es the sur­face of what we can do. Cloud­Trail and Cloud­Watch Events makes it easy to auto­mate many day-to-day oper­a­tional steps. With the help of Lamb­da of course ;-)

I work with API Gate­way and Lamb­da heav­i­ly. When­ev­er you cre­ate a new API, or make changes, there are sev­er­al things you need to do:

  • enable Detailed Met­rics for the deploy­ment stage
  • set up a dash­board in Cloud­Watch, show­ing request count, laten­cies and error counts
  • set up Cloud­Watch Alarms for p99 laten­cies and error counts

Because these are man­u­al steps, they often get missed.

Have you ever for­got­ten to update the dash­board after adding a new end­point to your API? And did you also remem­ber to set up a p99 laten­cy alarm on this new end­point? How about alarms on the no. of 4XX or 5xx errors?

Most teams I have dealt with have some con­ven­tions around these, but without a way to enforce them. The result is that the con­ven­tion is applied in patch­es and can­not be relied upon. I find this approach doesn’t scale with the size of the team.

It works when you’re a small team. Every­one has a shared under­stand­ing, and the nec­es­sary dis­ci­pline to fol­low the con­ven­tion. When the team gets big­ger, you need automa­tion to help enforce these con­ven­tions.

For­tu­nate­ly, we can auto­mate away these man­u­al steps using the same pattern. In the Mon­i­tor­ing unit of my course Pro­duc­tion-Ready Server­less, I demon­strat­ed how you can do this in 3 sim­ple steps:

  • Cloud­Trail cap­tures the Cre­at­eDe­ploy­ment request to API Gate­way.
  • Cloud­Watch Events pat­tern against this cap­tured request.
  • Lamb­da func­tion to a) enable detailed met­rics, and b) cre­ate alarms for each end­point.

If you use the Server­less frame­work, then you might have a func­tion that looks like this:

Cou­ple of things to note from the code above:

  • I’m using the server­less-iam-roles-per-func­tion plu­g­in to give the func­tion a tai­lored IAM role
  • The func­tion needs the apigateway:PATCH per­mis­sion to enable detailed met­rics
  • The func­tion needs the apigateway:GET per­mis­sion to get the API name and REST end­points
  • The func­tion needs the cloudwatch:PutMetricAlarm per­mis­sion to cre­ate the alarms
  • The envi­ron­ment vari­ables spec­i­fy SNS top­ics for the Cloud­Watch Alarms

The cap­tured event looks like this:

We can find the restApiId and stageName inside the detail.requestParameters attribute. That’s all we need to fig­ure out what end­points are there, and so what alarms we need to cre­ate.

Inside the han­dler func­tion, which you can find here, we per­form a few steps:

  • enable detailed met­rics with an updateStage call to API Gate­way
  • get the list of REST end­points with a getResources call to API Gate­way
  • get the REST API name with a getRestApi call to API Gate­way
  • for each of the REST end­points, cre­ate a p99 laten­cy alarm in the AWS/ApiGateway name­space

Now, every time I cre­ate a new API, I will have Cloud­Watch Alarms to alert me when the 99 per­centile laten­cy for an end­point goes over 1 sec­ond, for 5 minutes in a row.

All this, with just a few lines of code :-)

You can take this fur­ther, and have oth­er Lamb­da func­tions to:

  • cre­ate Cloud­Watch Alarms for 5xx errors for each end­point
  • cre­ate Cloud­Watch Dash­board for the API

So there you have it, a use­ful pat­tern for automat­ing away man­u­al ops tasks!

And before you even have to ask, yes I’m aware of this server­less plu­g­in by the ACloudGu­ru folks. It looks neat, but it’s ulti­mate­ly still some­thing the developer has to remem­ber to do.

That requires dis­ci­pline.

My expe­ri­ence tells me that you can­not rely on dis­ci­pline, ever. Which is why, I pre­fer to have a plat­form in place that will gen­er­ate these alarms instead.

Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.

You can contact me via Email, Twitter and LinkedIn.

Check out my new course, Complete Guide to AWS Step Functions.

In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.

Get your copy here.

Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.

You can also get 40% off the face price with the code ytcui.

Get your copy here.