This is the first in a mini series that accompanies my “ ” talk at and . the present and future of Serverless observability ServerlessConf Paris QCon London part 1 : new challenges to observability [this post] part 2 : first party observability tools from AWS part 3 : 3rd party observability tools part 4: the future of Serverless observability 2017 was no doubt the year the concept of observability became mainstream, so much so we now have an entire track at a big industry event such as QCon. Observability This is no doubt thanks to the excellent writing and talks by some really smart people like Cindy Sridharan and Chairty Majors: by Cindy Sridharan Monitoring and Observability by Baron Schwartz Monitoring isn’t Observability by Charity Majors Observability for Emerging Infra: what got you here won’t get you there As Cindy mentioned in her post though, the first murmurs of came from a by Twitter way back in 2013 where they discussed many of the challenges they face to debug their complex, distributed system. observability post Twitter’s microservices deathstar, circa 2013 A few years later, Netflix started the related idea of around how do we design tools that can give us holistic understanding of our complex system — that is, how can we design our tools so that they present us the information about our system, at the , and the amount of time and cognitive energy we need to invest to build a mental model of our system. writing about intuition engineering most relevant right time minimize correct Challenges with Serverless observability With Serverless technologies like AWS Lambda, we face a number of new challenges to the practices and tools that we have slowly mastered as we learnt how to gain observability for services running inside virtual machines as well as containers. As a start, we lose access to the underlying infrastructure that runs our code. The execution environment is locked down, and we have nowhere to install agents & daemons for collecting, batching and publishing data to our observability system. These agents & daemons used to go about their job quietly in the background, far away from the critical paths of your code. For example, if you’re collecting metrics & logs for your REST API, you would collect and publish these observability data outside the request handling code where a human user is waiting for a response on the other side of the network. But with Lambda, everything you do has to be done inside your function’s invocation, which means you lose the ability to perform background processing. Except what the platform does for you, such as: collecting logs from stdout and sending them to CloudWatch Logs collecting tracing data and sending them to X-Ray Another aspect that has drastically changed, is how concurrency of our system is controlled. Whereas before, we would write our REST API with a web framework, and we’ll run it as an application inside an EC2 server or a container, and this application would handle many concurrent requests. In fact, one of the things we , is their ability to handle large no. of concurrent requests. compare different web frameworks with Now, we don’t need web frameworks to create a scalable REST API anymore, API Gateway and Lambda takes care of all the hard work for us. , and that’s great news! Concurrency is now managed by the platform However, this also means that any attempt to batch observability data becomes less effective (more on this later), and for the same volume of incoming traffic you’ll exert a much higher volume of traffic to your observability system. This in turn can have a non-trivial performance and cost implications at scale. You might argue that “well, in that case, I’ll just use a bigger batch size for these observability data and publish them less frequently so I don’t overwhelm the observability system”. Except, it’s not that simple, enter, the lifecycle of an AWS Lambda function. One of the benefits of Lambda is that you don’t pay for it if you don’t use it. To achieve that, the Lambda service would garbage collect containers (or, concurrent executions of your function) that have not received a request for some time. I did some experiments to see how long that idle time is, which you can read about in . this post And if you have observability data that have not been published yet, then you’ll loss those data when the container is GC’d. Even if the container is continuously receiving requests, maybe with the help of something like the for the framework, the Lambda service would still GC the container after it has been active for a few hours and replace it with a fresh container. warmup plugin Serverless Again, this is a good thing, as it eliminates common problems with long running code, such as memory fragmentation and so on. But it also means, you can still lose unpublished observability data when it happens. Also, as I explained in a , those attempts to keep containers warm stop being effective when you have even a moderate amount of load against your system. previous post on cold starts So, you’re back to sending observability data eagerly. Maybe this time, you’ll build an observability system that can handle this extra load, maybe you’ll build it using Lambda! But wait, remember, you don’t have background processing time anymore… So if you’re sending observability data eagerly as part of your function invocation, then that means you’re hurting the user-facing latency and we know that (well, at least in any reasonably competitive market where there’s another provider the customer can easily switch to). latency affects business revenue directly Talk about being caught between a rock and a hard place… Finally, one of the trends that I see in the Serverless space — and one that I have experienced myself when I — is how powerful, and how simple it is to build an event-driven architecture. And Randy Shoup seems to too. migrated a social network’s architecture to AWS Lambda think so _We’ve all heard of events and event-driven programming, but in my experience, events are not used nearly enough in our…_hackernoon.com Events As First-Class Citizens And in this event-driven, serverless world, function invocations are often chained through some asynchronous event source such as Kinesis Streams, SNS, S3, IoT, DynamoDB Streams, and so on. In fact, of for AWS Lambda, only a few are classified as synchronous, so by design, the cards are stacked towards asynchrony here. all the supported event sources And guess what, tracing asynchronous invocations is hard. I wrote a on how you might do it yourself, in the interest of collecting and forwarding correlation IDs for distributed tracing. But even with the approach I outlined, it won’t be easy (or in some cases, possible) to trace through every type of event source. post X-Ray doesn’t help you here either, although it sounds like . At the time of writing, but that too, is on their list. they’re at least looking at support for Kinesis X-Ray also doesn’t trace over API Gateway Until next time… So, I hope I have painted a clear picture of what tool vendors are up against in this space, so you really gotta respect the work people like , and has done. IOPipe Dashbird Thundra That said, there are also many things you have to consider . yourself For example, given the lack of background processing, when you’re building a user facing API where latency is important, you might want to avoid using an observability tool that doesn’t give you the option to send observability data asynchronously (for example, by ), or you need to use a rather sample rate. leveraging CloudWatch Logs stringent At the same time, when you’re processing events asynchronously, you don’t have to worry about invocation time quite as much. But you might care about the cost of writing so much data to CloudWatch Logs and the subsequent Lambda invocations to process them. Or maybe you’re concerned that low-priority functions (that process the observability data you send via CloudWatch Logs) are and can throttle high-priority functions (like the ones that serve user-facing REST APIs!). In this case, you might choose to publish observability data eagerly at the end of each invocation. eating up your quota of concurrent executions In part 2, we’ll look at the first party tools we get from AWS. In part 3, we’ll look at other third party tools that are designed for serverless, such as IOPipe, Dashbird, Thundra and Epsagon. We will see how these tools are affected by the aforementioned challenges, and how they affect the way we evaluate these third party tools. Like what you’re reading but want more help? I’m happy to offer my services as an and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools. independent consultant I’m based in and currently the only UK-based . I have nearly of with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve . London, UK AWS Serverless Hero 10 years experience here I can also run an to help you get with your serverless architecture. You can find out more about the two-day workshop , which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices. in-house workshops production-ready here If you prefer to study at your own pace, then you can also find all the same content of the workshop as a I have produced for Manning. We will cover topics including: video course authentication authorization with API Gateway Cognito & & testing running functions locally & CI/CD log aggregation monitoring best practices distributed tracing with X-Ray tracking correlation IDs performance cost optimization & error handling config management canary deployment VPC security leading practices for Lambda, Kinesis, and API Gateway You can also get the face price with the code . Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP). 40% off ytcui

Amazon

Netflix

Trace

Twitter

YouTube

AWS Lambda — use the invocation context to better handle slow HTTP responses

Applying principles of chaos engineering to AWS Lambda with latency injection

Ask me for help about serverless

Read My Stories

Too Long; Didn't Read

Serverless observability brings new challenges to current practices

Serverless observability brings new challenges to current practices

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

6 Tips To Scale an AppSync Project To 200+ Resolvers That Will Blow Your Mind

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

6 Tips To Scale an AppSync Project To 200+ Resolvers That Will Blow Your Mind

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps