Best Practices of Serverless at Scale A serverless application in its infancy looks and runs vastly different to one at scale. When there are more components to manage, the key to operational excellence is rooted in serverless best practices. Dashbird was created with the mission to , no matter their size. As experienced developers ourselves, we’ve faced and understand the challenges found in the functionality of at-scale serverless architecture. In this article, we run through the the and to combat them. help developers succeed with modern cloud environments common serverless challenges, architectural patterns best practices Find out more about . scalable serverless designs for enterprises Exploring the Challenges As with anything, we should be constantly aspiring to catch problems sooner rather than later. Here is an example of an established but : early-stage serverless application As you can see, its workflow is simple and there is minimal load meaning the requests, execution times and concurrency are manageable. In just a few months, that same architecture can look like this: As load increases, the existing infrastructure comes under stress. This is a great exercise in identifying the potential points of failures in your system, and the scenarios in which those could happen. In this example, you can see clearly how each source has its own limit . It’s important to remember that while different services have different API limits and throttling limits, failures can also happen through configuration mistakes and code errors. leading to either failure or performance degradation Common issues at higher loads: Lambda Concurrency is the number of requests that your function serves at any time. A good formula for estimated Lambda concurrency is: Lambda concurrency Average Execution Time * Average Request Per Second = Estimate Concurrency This helps to determine the that’ll be used simultaneously. With this in mind, let’s remind ourselves of some default AWS limits in place. number of containers Function-Based Burst Limits even when concurrency is running fine. There is between a 500-3000 initial burst limit on functions (region dependent) with the ability to scale up by 500 every one minute. These can still occur Account Wide Limit These are and built-in for . By default, it’s set to 1000 concurrent executions, however these can be changed. soft limits your protection API Gateway Limit There is a 1 , per region which as needed. However, the 5k concurrency burst limit and 29-second timeout lime cannot be changed. 0k request per second limit can be increased Other AWS API Limits , which is important to factor in when building and mapping out your application for scale. For example, KMS has a limit between 5,500-10,000 requests per second, depending on the region. All AWS APIs have limits As your or if it often experiences spikey loads, these limits need to be . application scales kept in mind for stable performance Architectural Patterns and Best Practice An unoptimized at-scale serverless application would look like this: With so many requests per second, the as other For a relational database, 3,000 new connections per second is a huge load and can cause lag in your system. Additionally, the 7,500 containers now needed . stress becomes clear resources multiply. increases your costs significantly These are our to help with this. top tips for code-level optimizations Keep everything in the , and only connect the database when KMS queries have been cached. By doing this, you have. Keep . Manage all . Initialization Phase executions will only run for the main logic orchestration out of code connections out of the handler code Using the above, the optimized at-scale serverless architecture now looks like this: You can see a , as the connection doesn’t need to be established and the total connections resulting in a . huge reduction in the execution time far smoother performance Additional Serverless Patterns to Question Do you need an API response? A habit we can fall into is always having a detailed database response from the API, when sometimes a simple acknowledgment is all that’s needed. By doing this, you can decouple the database from the KMS request and create an asynchronous processing model using SQS and Lambda, allowing you to set your concurrency limit and the load. There is no change to the model. Definitely need an API response? If an API response is needed, there are few optimization tweaks to consider. Switching to a serverless, non-relational database such as DynamoDB or Serverless Aurora. Using the HTTP interface and the proxy/cache elements, there is no connection limit and being non-relational means there will be less lag and slowness to experience. Implement client retries and backoffs, to wait for the response outside of the synchronous call*. Implement webhooks or polling long tasks*. *These features may have a negative impact for the client, however at a very high scale, the compromise can be worthwhile. Don’t orchestrate in code The purpose of serverless is to keep code focused on business logic, meaning that elements of your serverless application of undifferentiated value can use managed services. Make use of the best services to support your application’s functionality. Additionally don’t wait in code, and instead, use to enable tasks to be run in parallel and enable automatic triggers and retries. This is one of the best optimization actions many of our customers have seen from both a performance perspective and a reduction of costs. Step Functions Tackling Operating and Monitoring Challenges With the benefits of serverless, comes a new host of monitoring challenges to overcome, which is where Dashbird . can provide value and expertise Challenges Using Managed ServicesThere is like we are used to. It’s no longer a case of attaching an agent to the API to send a failure alarm, instead we have a more abstract control panel to work from. Serverless components also have a , with each resource providing logs, tracing data, errors, and configuration data; it rapidly piles up. no code access huge amount of data output to the service used. The issues found in API Gateway vary from Lambda, for example, emphasizing the requirement for and all their possible errors. Its large scale nature naturally means . Challenges Using a Distributed SystemThere is a lot of . Failures are very specific deep knowledge of individual services challenges are potentially larger and widespread surface area to manage There can be hundreds or thousands of parts to your infrastructure, which organically increases the likelihood of failure, errors, and vulnerabilities for attackers. It’s a , adapting to demand and requirements. and their interactions are new in the serverless world. dynamic and forever changing system Understanding the resource relationships Dashbird is built on three core pillars that target all these issues: Centralized Observability and Visualization Automated Failure and Error Alert Actionable Well-Architected and Best Practice Insights Centralized Observability and Visualisation It’s important to make the already available mass of for us. Democratizing data breaks down traditional silos and enables users to navigate their own data more easily through customizable queries and searches. Dashbird’s use of and offering visualization of your data, for easier and quicker understanding. data output work efficiently prebuilt views simple dashboard The centralized platform offers , where you’re able to understand resource relationships and view your . dynamic resource management entire application in one place Automated Failure and Error Alert Monitoring is only effective if there is . Dashbird uses out-of-the-box automated alerts , which integrate seamlessly into a developer’s workflow by sending in real-time via Slack or email. continuous alert coverage across your entire infrastructure notifying you of failures and errors Dashbird also meaning that any potential negative trails (not yet failures) are highlighted and can be investigated before they escalate. proactively listens to log and metric data Actionable Well-Architected and Best Practice Insights Building serverless applications requires consistent best practice habits, which can be difficult to maintain or even start. Using the lens, Dashbird helps to ensure . AWS Well-Architected your system is built and fixed based on industry-standard best practices The Insights Engine such as delays, consumption issues, or limits enabling users to take action and to be reliable at any scale. Within its periodic assessments, Dashbird also helps to , discovering areas needing encryption, inactive resources and over- or under-provisioned components all of which can be . detects non-binary issues improve and optimize their architecture instill strong security and compliance practices increasing exposure for attacks Previously published at https://dashbird.io/blog/challenges-of-serverless-at-scale/