Managing applications often comes up as one of the biggest concerns for businesses; How can it work smoothly? How do we ? How do we maintain best practices with constantly evolving infrastructure? In this article, we run through the best approach for operational excellence looking at , , and and best practices. at scale monitor so many resources serverless monitoring strategy serverless alerting strategy security compliance The Serverless Challenge In a containerized application, there would be anywhere between 5-20 servers with business logic within them. To monitor this, an agent would simply be attached to oversee the entire system and . The , as we know. At scale, there can be thousands of resources using tens of different services making monitoring and alerting much more complex. alert of any issues serverless model is quite different A typical serverless application would look like this: Challenge 1: The , particularly if we consider that for everyone resource there are at least five pieces of related data. Think logs, metrics, traces, and configuration data. volume of data from a serverless application grows exponentially Challenge 2: As . While we may know of a few points of failure within our own infrastructure, . For this reason, it’s best to view each resource as a potential source of failure that must be monitored. infrastructure grows, so does the variety of failures it’s near impossible to know them all The Solution Observability. Different from visibility, observability is the measure of the internal states of a system inferred by the knowledge of its external output. Using this method requires no interference and instead puts data output to work, helping to visualize and understand the health and performance of the serverless application. For success, it’s a case of breaking the traditional silos of data and sorting and organizing the mass volume into workable, relevant, and actionable insights. Serverless Application Monitoring Strategy A how your application is running, gathering insights, and discovering opportunities for performance and cost optimization. strategy for serverless monitoring is key to understanding First, let’s look at the primary goals of monitoring: Reduce time to discovery for customer-affecting incidents. Increased confidence in building quickly and iterating products with minimal risk. Enabling developers to focus on customers, not operations. To learn from mistakes, and to pre-empt more issues. Next, we need to conduct some risk mapping (SLAs) making clear what parameters are acceptable. What are the application requirements for , and These three elements are intertwined and dependent on each other, so it’s important to make this clear early on. What is the acceptable time to discover failures? After discovery, what is the acceptable time to fix an error or failure? Speed Uptime Cost? Finally, state clearly the . This may be completely obvious but it’s important to have these referrals any time it goes off track. requirements of having monitoring in place For the infrastructure to ingest, navigate and interrogate the data. To detect failures as quickly as possible. To debug the system quickly and understand the issue efficiently. Statements and requirements now made, next is the best approach to navigating and monitoring data. It’s important to democratise away from silos for true interrogation and useful analysis. This means breaking down traditional barriers between logs, metrics and tracing data. Having a unified view through correlating metric events from log events, and having cross region and cross account visibility provides better context within the bigger picture of the application. It’s also important to have the ability to look at data differently making the data really work for you and your needs. This might look like various reports, a dashboard or search and query functions. Arguably, however, the most important quality for serverless monitoring is elasticity. We need the architecture to automatically digest the changing data from constantly changing resources, without the overhead of importing or configuring the data to make it readable. This is how Dashbird makes full use of elastic data ingestion. While big picture views are important, so are specific level views. Account and Microservices Level : to detect major problems and fix these as early as possible. Goal Through this specific lens, we can for overall application health, the most concerning areas as well as cost and activity metrics. understand trends Resource Level : to understand the specific resource’s health, performance, costs, and associated problems. Goal Given this is a place where developers spend most of their time, it’s to include in your strategy. Looking at a Lambda resource, for example, there are multiple areas to analyze: invocations, errors, cost, cold starts, and memory usage. We need to be able to drill down into anomalies, and past and present errors in order to improve and better align with best practices. Here is how Dashbird does this: one of the most important views Execution Level : to understand the problem in detail. Goal It’s at the execution level that we are able to like duration, memory usage, and start and end times for issues and optimization. Going deeper though, we can look at the profile of the execution; requests to other resources, how long it took, and its level of success. We can also detect and here. source full activity details retries cold starts It’s with excellent monitoring, we can defend against known and unknown failures. Serverless Application Alerting Strategy For true operational excellence, . Failures and errors are inevitable and so reducing the time to discover and fix is imperative. As discussed, monitoring needs to be constant with preemptive checks continuously running for security, best practice, cost, and performance. However, we also need to be able to filter log events for errors and failures; . monitoring needs to be paired with a good serverless alerting strategy this is the first step in the alerting strategy Failure Detection from Logs Filtering log streams in real-time to detect failures is . Dashbird has for log data coming in, without the need for any code changes or agents. Filtering is also possible with other systems, however, it can be lengthy to set up and difficult to maintain as changes to serverless architecture takes place over time. critical to operations automated alerts Metric Alarms Understandably crucial is the use of metric alarms, however, a common challenge is the and . scale of metrics what to prioritize without too much noise We recommend starting with . For example, starting with an API Gateway and going downstream to the function and database used on that journey. . However, an alternative for those with time constraints is to focus on the API only. From here, errors or latency issues will inevitably show up for which you can investigate when needed. customer-affecting metrics All of these components should ideally be monitored is another great metric to prioritize for maximum impact. For resources with high memory usage, for example, there can be an increased delay in downstream services. This snowball effect making it important to identify early. Leading indicators can add a huge amount of strain Operational excellence also takes into account , and so and are good to include in your strategy too. While you might not have many risks of unexpected increased costs, we recommend and , ticking off one less thing to have to worry about. efficient spending cost monitoring alarms setting up limits alerts from the start When it comes to your alerting strategy, do it programmatically and centralize the alerts with other alerts. and so it makes perfect sense for the . A serverless application is already heavily intertwined alarms to have some level of integration too A Dashbird example of the alerts available: Serverless Security and Compliance Best Practice Serverless best practice involves and . Things like encryption, habitually setting up functions with the rule of least privilege and detecting redundant services to improve costs are all part of this. Dashbird continuously checks across the whole infrastructure to ensure this: continuous risk management assessment of optimization opportunities Also in best practice is to . Test failures and errors in a staging environment on a regular basis, often once a quarter is enough but the practice depends on the – an important consideration. plan and proactively practice incident management size and impact of the failure For unplanned incidences, it’s always best practice to , setting time aside for to . Importantly too, against stated goals and SLAs helps to give a and . document them review prevent it from happening again tracking actual performance full view of the application’s success areas that can be improved It goes without saying that and practice, especially in serverless. operational excellence takes a lot of time or continue doing: Some things you can do today to get started Read the . AWS Well Architected Serverless Whitepaper Work with AWS solutions architects and the serverless community. Use to implement best practices. third-party tools There are many on the market with free trials, like , reducing risk and commitment. Dashbird Previously published at https://dashbird.io/blog/operational-excellence-serverless-application/

Discovery

AWS Step Functions: When They Might Come In Handy

Awesome Hacks To Master AWS Step Functions

Serverless af

Nominated for 2022 - HackerNoon Contributor of the Year - Cloud Computing

Nominated for 2022 - HackerNoon Contributor of the Year - Devops

Nominated for 2022 - HackerNoon Contributor of the Year - Serverless

Nominated for 2022 - HackerNoon Contributor of the Year - Aws

Too Long; Didn't Read

How to Optimize Large Scale Serverless Applications for Operational Excellence

How to Optimize Large Scale Serverless Applications for Operational Excellence

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Things in Engineering We Don't Spend Enough Time On

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

10 Things in Engineering We Don't Spend Enough Time On

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps