113 reads

Site Reliability Engineering with Alibaba Cloud's Monitoring Services

by Sal KimmichJune 18th, 2021

Too Long; Didn't Read

Alibaba Cloud is a major cloud services provider in China and specific regions of Asia. Alibaba is rapidly expanding its cloud services kitty as well as service offering geography. Services offered by Alibaba include Auto Scaling, Application Real-Time Monitoring Service, Global Accelerator, Global Traffic Manager and High Availability Service (AHAS) Cloud Monitor is the enterprise-grade service on Alibaba Cloud that is used to monitor cloud-native and custom solution metrics. The best part is that the service itself is free.

featured image - Site Reliability Engineering with Alibaba Cloud's Monitoring Services

Why Consider Alibaba from an SRE Perspective?

Alibaba Cloud is a major cloud services provider in China and specific regions of Asia. When compared to top global cloud service providers, Alibaba is rapidly expanding its cloud services kitty as well as service offering geography.

In this review, we take a quick look at the services offered by the Alibaba Cloud to monitor and scale services in a reliable way, as a great way to dive in for someone new to the Alibaba ecosystem, or looking for an easy way to compare to other cloud providers.

Reliability and low cost are the top two objectives of all cloud platforms across the globe. User experience is directly related to the reliability of the solution and the platform.

It can adversely impact the business of any organization if the solution is not reliable, either in terms of availability, performance, scalability, or security.

To overcome these challenges, organizations employ a dedicated team to bridge the gaps of development, delivery, and operations teams - often through the assistance of tooling for monitoring and observability.

Defining the correct SLAs and SLOs is an equally important activity in achieving the desired levels of reliability beyond tooling, for a deeper understanding of what SLAs, SLOs, and SLIs are you can reference the other article in this series on SRE for AWS.

Auto Scaling

Alibaba offers elastic compute services, elastic desktop services, and supercomputing services.

Auto Scaling is a management service that performs scale-out and scale-in of computing resources within a scaling group. It receives the trigger for scaling activity from the monitoring services, such as Cloud Monitor, once the defined threshold is reached, for example, 80% load for scale-out and 30% load for scale-in.

It helps in maintaining performance stability for the entire solution. The best part is that the service itself is free.

Cloud Monitor

Monitoring is a key activity for the SRE teams. Without monitoring the SLIs, it is not possible to maintain and improve the service reliability. Cloud Monitor is the enterprise-grade service on Alibaba Cloud that is used to monitor cloud-native and custom solution metrics.

It uses the application events, data events, host and container events, and log messages to monitor and raise alerts. It can visualize the results on its dashboard or send out triggers to other services for acting.

Application Real-Time Monitoring Service (ARMS)

ARMS is a closely related service to Cloud Monitor. It helps in monitoring the solutions end-to-end broadly under 3 categories: Browser monitoring, Application monitoring, and Prometheus monitoring.

It is extremely helpful for large, distributed solutions with multiple application modules running in tandem to provide business functionality.

It has the capabilities to perform health checks and monitor the performance to the lowest levels of browser activities and service invocations.

Resource Orchestration Service (ROS)

ROS is one of the most useful tools for the SRE teams for managing the cloud resources of the entire solution.

It helps in creating standard deployment templates and using them to provision a fully configured new platform and solution instances.

It can be used to provide the very first solution instance, other instances in multiple zones for performance needs, and failover instances during a disaster recovery process.

Server Load Balancer (SLB)

As the name suggests, SLB helps with the distribution of concurrent request loads to the server cluster. It can perform load balancing using Layer 4 in Classic mode and Layer in Application mode.

It supports protocols such as TCP, HTTP/S, UDP, and QUIC. It can make use of custom attributes from the Network Layer 7 to provide application health monitoring and throttling features.

Global Traffic Manager (GTM)

GTM uses intelligent DNS resolution to redirect requests to the servers in the nearest geographic location. Apart from application health checks and load balancing, it provides automatic failover needed in case of geo-disaster recovery and zone disaster recovery. It ensures the availability of applications even in case of a failed zone.

Global Accelerator

Global Accelerator uses Border Gateway Protocol (BGP) to accelerate network traffic by leveraging points of presence (PoP) for serving requests from the nearest geolocation.

It helps in reducing network latency, network fluctuations, and data loss. At the same time, it helps in achieving both high availability and high security.

Application High Availability Service (AHAS)

AHAS is a SaaS-based service that offers features such as topology detection, service fallback options, and availability assessment based on fault injection.

It helps in traffic shaping for the entire solution in a fast and cost-effective manner. This is a very handy tool for maintaining the high availability of the solution without manual intervention.

Security Center

Security Center is a unified security console for identifying, analyzing, alerting, and managing all security threats for the platform and the solution.

It helps in protecting the solution from virus attacks, hacking attacks, ransomware attacks, web-tampering attacks, and other similar threats.

It can also be configured to take automated action in case of attacks. It is extremely helpful in maintaining the overall reliability of the solution.

Action Trail

Action Trail allows delivering events and messages to Log Stores or Object Stores. Once the data is collected, it allows performing behavior analysis, security analysis, auditing, and compliance checks. SRE teams use it to detect and prevent unwanted activities.

Although serving customers in a limited geographic location, Alibaba Cloud is a cost-effective platform equipped with all the latest tools and features to enable reliability.

If you and your team are working with Alibaba in your pursuit of Site Reliability Engineering excellence, I'd love to hear your stories!