As enterprises adopt cloud services and infrastructure for hosting their systems, the scale of operations increases drastically posing a great challenge to maintaining reliability, security, and uptime. While DevOps practices might be in place, a specialized team is needed to protect against the threat of failures in a consistent manner. Site Reliability Engineering is the next step in the DevOps practice. It ensures that the target application remains available, performant, and optimally utilized.
Before delving into the concept of SRE on AWS, it is important to understand three important terms:
The SRE team needs to work through the SLAs and SLOs with the help of SLIs to keep the solution well within the Error Budget. That is the only way to improve the reliability of a system.
It is important to note that the SLAs and SLOs are independent of the cloud service providers. The cloud service providers do impact the way an SLI is captured, processed, and acted upon.
When migrating to the cloud platforms, there are different objectives that enterprises target. The most important factor to understand is that the SLO of the solution must adhere to the SLO of the cloud platform. The SLAs are defined based on the business requirements of the solution and application as well as the platform is optimized iteratively to meet those agreements.
A common set of SLAs, that are impacted by the choice of the cloud platform, include the following:
AWS is one of the largest cloud service providers. It provides tools and services that can help the DevOps teams to implement SRE principles irrespective of the scale and volume of the solution deployed on the AWS cloud.
Various AWS services help the solution teams to capture the SLIs that are needed to be monitored for meeting the SLOs of a solution running on the AWS cloud.
These services produce different types of actionable outcomes which help in improving the overall reliability of the solution. Each of these services looks for different indicators in the solution.
Amazon CloudWatch is the monitoring and observability service built for the SRE teams to gather data about the system performance and present a unified view of actionable insights and system health.
It helps in collecting the metrics of the solution components from across various cloud services, such as EC2, EKS, RDS, S3, Lambda, VPC, and others.
It looks for logs and application events produced by the solution. Through customizable rule sets, it helps in creating actionable alerts by following the approach of collect-monitor-alert-act- analyze. CloudTrail maintains a complete trail of user activities that is very useful during the diagnosis and analysis of failures, risks, and threats.
AWS Elastic Load Balancers provide the ability to detect the performance load of the application, network bottlenecks, and regional demands to spin up new instances of the solution components. They help with the SLIs that can help in analyzing and upgrading the capacity of the solution infrastructure on AWS.
AWS services that capture the application events can be modeled to capture the SLIs related to the internal functioning of the solution components, such as Simple Notification Service. It can be used to monitor specific application events and raise custom alerts from within the application, and it's easy to set up an SNS Topic Using the AWS CLI.
Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect the AWS workloads and solution components. It analyzes inputs from sources such as CloudTrail, VPC logs, DNS logs, and application logs.
Amazon provides a host of services with their own tradeoffs that can be used to capture data which acts as SLIs for the SRE process. The area of SRE is evolving continuously and more data is being produced and collected for analysis and action. With more and more inputs, SRE teams are targeting to provide a near 100% reliability and availability of solutions.
AWS provides multiple services that together create the security stealth for the hosted solutions. AWS provides network and infrastructure security through the use of services such as VPC and VPN. Access controls are provided by services such as IAM that authenticate and authorize users. Data protection and encryption services are provided with the help of competent partners of AWS.
This is a non-exhaustive list of monitoring tools available on AWS, but these are the services that I find most successfully implemented in teams just starting out with their SRE journey with AWS. For a complete review of the services available on other services like Azure and Alibaba Cloud, please see the following blogs in this series.
If you and your team are building out your best practices in site reliability engineering, feel free to reach out with your questions, comments, and favorite stories from the cutting edge of DevOps and SRE!