Site Reliability Engineering with Amazon Web Services

As enterprises adopt cloud services and infrastructure for hosting their systems, the scale of operations increases drastically posing a great challenge to maintaining reliability, security, and uptime. While DevOps practices might be in place, a specialized team is needed to protect against the threat of failures in a consistent manner. Site Reliability Engineering is the next step in the DevOps practice. It ensures that the target application remains available, performant, and optimally utilized.

Before delving into the concept of SRE on AWS, it is important to understand three important terms:

Service Level Agreements (SLA): These are the assurance of availability and reliability of the solution, and generally includes the application as well as the platform
Service Level Objectives (SLO): These are the targets that need to be met to fulfill the SLAs
Service Level Indicators (SLI): These are the actual data points that highlight the current situation of a solution

The SRE team needs to work through the SLAs and SLOs with the help of SLIs to keep the solution well within the Error Budget. That is the only way to improve the reliability of a system.

What are the common SLAs and SLOs?

It is important to note that the SLAs and SLOs are independent of the cloud service providers. The cloud service providers do impact the way an SLI is captured, processed, and acted upon.

When migrating to the cloud platforms, there are different objectives that enterprises target. The most important factor to understand is that the SLO of the solution must adhere to the SLO of the cloud platform. The SLAs are defined based on the business requirements of the solution and application as well as the platform is optimized iteratively to meet those agreements.

A common set of SLAs, that are impacted by the choice of the cloud platform, include the following:

Data security and Data privacy levels are defined to meet the standards of the business domain.
High Availability of the platform and the solution is defined in terms of minutes of downtime during the entire year.
Low network latency values are identified for the real-time, on-field, or robotic applications and devices.
Response time and Performance requirements are driven by the business load with an intent of elastic expansion of resources or vertical and horizontal scaling of resources.
Failure management definition with detection and prevention strategy, and recovery timelines.

The most common SLOs set by the solution owners to meet their SLAs are below:

Vulnerability Assessment and Penetration Testing: It must be done every month. All critical and high-level threats should be addressed immediately.
Access controls tests: It should be tested every three months to determine the correct users have access to correct resources only.
Availability: Solution should run on multiple nodes and zones so that one or more instances are up at a given point in time. Switchovers and failovers should be tested once every quarter.
Performance Test: Thorough test should be done with a varying load of data, transactions, and concurrency to determine the scalability, throughput, and latency.
Failure Detection and Prevention Strategy: The solution should produce enough indications of an impending failure and solution health checks should reveal the possible failure before time.
Failure Diagnosis: The solution must provide event and incident logs to diagnose the failure and threat points quickly.
Failure Recovery: It should be tested alongside availability tests to determine how quickly the existing instances of the solution can be back online in case of a failure without the need for new instances.

How does AWS help SRE teams?

AWS is one of the largest cloud service providers. It provides tools and services that can help the DevOps teams to implement SRE principles irrespective of the scale and volume of the solution deployed on the AWS cloud.

Various AWS services help the solution teams to capture the SLIs that are needed to be monitored for meeting the SLOs of a solution running on the AWS cloud.

These services produce different types of actionable outcomes which help in improving the overall reliability of the solution. Each of these services looks for different indicators in the solution.

Observability and Monitoring

Amazon CloudWatch is the monitoring and observability service built for the SRE teams to gather data about the system performance and present a unified view of actionable insights and system health.

It helps in collecting the metrics of the solution components from across various cloud services, such as EC2, EKS, RDS, S3, Lambda, VPC, and others.

It looks for logs and application events produced by the solution. Through customizable rule sets, it helps in creating actionable alerts by following the approach of collect-monitor-alert-act- analyze. CloudTrail maintains a complete trail of user activities that is very useful during the diagnosis and analysis of failures, risks, and threats.

Availability

AWS Elastic Load Balancers provide the ability to detect the performance load of the application, network bottlenecks, and regional demands to spin up new instances of the solution components. They help with the SLIs that can help in analyzing and upgrading the capacity of the solution infrastructure on AWS.

Health Checks

AWS services that capture the application events can be modeled to capture the SLIs related to the internal functioning of the solution components, such as Simple Notification Service. It can be used to monitor specific application events and raise custom alerts from within the application, and it's easy to set up an SNS Topic Using the AWS CLI.

Threat Detection

Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect the AWS workloads and solution components. It analyzes inputs from sources such as CloudTrail, VPC logs, DNS logs, and application logs.

Amazon provides a host of services with their own tradeoffs that can be used to capture data which acts as SLIs for the SRE process. The area of SRE is evolving continuously and more data is being produced and collected for analysis and action. With more and more inputs, SRE teams are targeting to provide a near 100% reliability and availability of solutions.

Data Security and Privacy

AWS provides multiple services that together create the security stealth for the hosted solutions. AWS provides network and infrastructure security through the use of services such as VPC and VPN. Access controls are provided by services such as IAM that authenticate and authorize users. Data protection and encryption services are provided with the help of competent partners of AWS.

This is a non-exhaustive list of monitoring tools available on AWS, but these are the services that I find most successfully implemented in teams just starting out with their SRE journey with AWS. For a complete review of the services available on other services like Azure and Alibaba Cloud, please see the following blogs in this series.

If you and your team are building out your best practices in site reliability engineering, feel free to reach out with your questions, comments, and favorite stories from the cutting edge of DevOps and SRE!