Yan Cui

AWS Serverless Hero. Independent Consultant. Developer Advocate at Lumigo.

Serverless observability, what can you use out of the box?

part 1 : new challenges to observability

part 2 : 1st party observability tools from AWS [this post]

part 3 : 3rd party observability tools

part 4: the future of Serverless observability

In part 1 we talked about the challenges serverless brings to the table. In this post, let’s look at 1st party tools from AWS

Out of the box we get a bunch of tools pro­vid­ed by AWS itself:

  • Cloud­Watch for mon­i­tor­ing, alert­ing and visu­al­iza­tion
  • Cloud­Watch Logs for logs
  • X-Ray for dis­trib­uted trac­ing
  • Ama­zon Elas­tic­Search for log aggre­ga­tion

CloudWatch Logs

When­ev­er you write to std­out, those out­puts are cap­tured by the Lamb­da service and sent to Cloud­Watch Logs as logs. This is one of the few background processing you get, as it’s pro­vid­ed by the plat­form.

All the log mes­sages (tech­ni­cal­ly they’re referred to as events) for a giv­en function would appear in Cloud­Watch Logs under a sin­gle Log Group.

As part of a Log Group, you have many Log Streams. Each con­tains the logs from one con­cur­rent exe­cu­tion (or con­tain­er) of your func­tion, so there’s a one-to-one map­ping.

So that’s all well and good, but it’s not easy to search for log mes­sages in Cloud­Watch Logs. There’s cur­rent­ly no way to search the logs for mul­ti­ple func­tions at once. Whilst AWS has been improv­ing the ser­vice, it still pales in com­par­i­son to oth­er alter­na­tives on the mar­ket.

It might suf­fice as you start out, but you’ll prob­a­bly find your­self in need of some­thing more soon after.

For­tu­nate­ly, it’s straight­for­ward to get your logs out of Cloud­Watch Logs.

You can stream them to Amazon’s host­ed Elas­tic­search ser­vice. But don’t expect it to be a like-for-like expe­ri­ence with your self-host­ed ELK stack though. Liz Ben­nett wrote a detailed post on some of the prob­lems they ran into when using Ama­zon Elas­tic­search at scale. Please give that a read if you’re think­ing about adopt­ing Ama­zon Elas­tic­search.

Alter­na­tive­ly, you can stream the logs to a Lamb­da func­tion, and ship them to a log aggre­ga­tion ser­vice of your choice. I won’t go into detail here as I have writ­ten about it at length pre­vi­ous­ly, just go and read this post instead.

You can stream logs from CloudWatch Logs to just about any log aggregation service, via Lambda.

CloudWatch Metrics

With Cloud­Watch, you get some basic met­rics out of the box. Invo­ca­tion count, error count, invo­ca­tion dura­tion, etc. All the basic teleme­try about the health of a func­tion.

But Cloud­Watch is miss­ing some valu­able data points, such as:

  • esti­mat­ed costs
  • con­cur­rent exe­cu­tions : Cloud­Watch only report this for func­tions with reserved con­cur­ren­cy
  • cold starts
  • billed dura­tion : Lamb­da reports this in Cloud­Watch Logs, at the end of every invo­ca­tion. Because Lamb­da invo­ca­tions are billed in 100ms blocks, a 102ms invo­ca­tion would be billed for 200ms. It will be a use­ful met­ric to see along­side Invo­ca­tion Dura­tion to iden­ti­fy cost opti­miza­tions)
  • mem­o­ry usage : Lamb­da reports this in Cloud­Watch Logs too, but it’s not record­ed in Cloud­Watch
You get 6 basic metrics about the health of a function.

There are ways to record and track these met­rics your­self, see this post on how to do that. Oth­er providers like IOPipe (more on them in the next post) would also report these data points out of the box.

You can set up Alarms in Cloud­Watch against any of these met­rics, here are some good can­di­dates:

  • throt­tled invo­ca­tions
  • region­al con­cur­rent exe­cu­tions : set thresh­old based on % of your cur­rent region­al lim­it
  • tail (95 or 99 per­centile) laten­cy against some accept­able thresh­old
  • 4xx and 5xx errors on API Gate­way

And you can set up basic dash­board in Cloud­Watch too, at $3 per month per dash­board (first 3 are free).

X-Ray

For dis­trib­uted trac­ing, you have X-Ray. To make the most of trac­ing, you should instru­ment your code to gain even bet­ter vis­i­bil­i­ty.

Like Cloud­Watch Logs, col­lect­ing traces do not add addi­tion­al time to your function’s invo­ca­tion. It’s a back­ground pro­cess­ing that the plat­form provides for you.

From the trac­ing data, X-Ray can also show you a ser­vice map like this one.

X-Ray gives you a lot of insight into the run­time per­for­mance of a func­tion. How­ev­er, its focus is nar­row­ly on one func­tion, the dis­trib­uted aspect is severe­ly under­cooked. As it stands, X-Ray cur­rent­ly doesn’t trace over API Gate­way, or asyn­chro­nous invo­ca­tions such as SNS or Kine­sis.

It’s good for hom­ing in on per­for­mance issues for a par­tic­u­lar func­tion. But it offers lit­tle to help you build intu­ition about how your sys­tem oper­ates as a whole. For that, I need to step away from what hap­pens inside one func­tion, and be able to look at the entire call chain.

After all, when the engi­neers at Twit­ter were talk­ing about the need for observ­abil­i­ty, it wasn’t so much to help them debug per­for­mance issues of any sin­gle end­point, but to help them make sense of the behav­iour and performance of their sys­tem. A sys­tem that is essen­tial­ly one big, com­plex and high­ly con­nect­ed graph of ser­vices.

With Lamb­da, this graph is going to become a lot more com­plex, more sparse and more con­nect­ed because:

  • instead of one ser­vice with 5 end­points, you now have 5 func­tions
  • func­tions are con­nect­ed through a greater vari­ety of mediums — SNS, Kine­sis, API Gate­way, IoT, you name it
  • event-dri­ven archi­tec­ture has become the norm

Our trac­ing tools need to help us make sense of this graph. They need to help us visu­al­ize the con­nec­tions between our func­tions. And they need to help us fol­low data as it enters our sys­tem as a user request, and reach­es out to far cor­ners of this graph through both syn­chro­nous and asyn­chro­nous events.

And of course, X-Ray do not span over non-AWS ser­vices such as Auth0, or Google Big­Query, or Azure func­tions.

But those of us deep in the server­less mind­set see the world through SaaS-tint­ed glass­es. We want to use the ser­vice that best address­es our needs, and glue them togeth­er with Lamb­da.

At Yubl, we used a num­ber of non-AWS ser­vices from Lamb­da. Auth0, Google Big­Query, GrapheneDB, Mon­go­Lab, and Twillio to name a few. And it was great, we don’t have to be bound by what AWS offers.

My good friend Raj also did a good talk at NDC on how he uses ser­vices from both AWS and Azure in his wine start­up. You can watch his talk here.

And final­ly, I think of our sys­tem like a brain. Like a brain, our sys­tem is made up of:

  • neu­rons (func­tions)
  • synaps­es (con­nec­tions between func­tions)
  • and elec­tri­cal sig­nals (data) that flow through them

Like a brain, our sys­tem is alive, it’s con­stant­ly chang­ing and evolv­ing and it’s con­stant­ly work­ing! And yet, when I look at my dash­boards and my X-Ray traces, that’s not what I see. Instead, I see a tab­u­lat­ed list that does not reflect the move­ment of data and areas of activ­i­ty. It doesn’t help me build up any intu­itive under­stand­ing of what’s going on in my sys­tem.

A brain sur­geon wouldn’t accept this as the pri­ma­ry source of infor­ma­tion. How can they pos­si­bly use it to build a men­tal pic­ture of the brain they need to cut open and oper­ate on?

I should add that this is not a crit­i­cism of X-Ray, it is built the same way most observ­abil­i­ty tools are built.

But maybe our tools need to evolve beyond human com­put­er inter­faces (HCI) that wouldn’t look out of place on a clip­board (the phys­i­cal kind, if you’re old enough to have seen one!). And it actu­al­ly reminds me of one of Bret Victor’s sem­i­nal talks, stop draw­ing dead fish.

Net­flix made great strides towards this idea of a live dash­board with Vizcer­al. Which they have also kind­ly open sourced.

Conclusions

AWS pro­vides us with some decent tools out of the box. Whilst they each have their short­com­ings, they’re good enough to get start­ed with.

As 1st par­ty tools, they also enjoy home field advan­tages over 3rd par­ty tools. For exam­ple, Lamb­da col­lects logs and traces with­out adding to your func­tion invo­ca­tion time. Since we can’t access the serv­er any­more, 3rd par­ty tools can­not per­form any back­ground pro­cess­ing. Instead they have to resort to workarounds or are forced to col­lect data syn­chro­nous­ly.

How­ev­er, as our server­less appli­ca­tions become more com­plex, these tools need to either evolve with us or they will need to be replaced in our stack. Cloud­Watch Logs for instance, can­not search across mul­ti­ple func­tions. It’s often the first piece that need to be replaced once you have more than a dozen func­tions.

In the next post, we will look at some 3rd par­ty tools such as IOPipe, Dash­bird and Thun­dra. We will dis­cuss their val­ue-add propo­si­tion as well as their short­com­ings.

Like what you’re reading but want more help? I’m happy to offer my services as an independent consultant and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools.

I’m based in London, UK and currently the only UK-based AWS Serverless Hero. I have nearly 10 years of experience with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve here.

I can also run an in-house workshops to help you get production-ready with your serverless architecture. You can find out more about the two-day workshop here, which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices.

If you prefer to study at your own pace, then you can also find all the same content of the workshop as a video course I have produced for Manning. We will cover topics including:

  • authentication & authorization with API Gateway & Cognito
  • testing & running functions locally
  • CI/CD
  • log aggregation
  • monitoring best practices
  • distributed tracing with X-Ray
  • tracking correlation IDs
  • performance & cost optimization
  • error handling
  • config management
  • canary deployment
  • VPC
  • security
  • leading practices for Lambda, Kinesis, and API Gateway

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).

More by Yan Cui

Topics of interest

More Related Stories