AIOps is a fast-developing field, but AIOps platforms still face challenges of efficiency, reliability, and costs
This article is part of the Search AIOps mini series.
AIOps, a term that once stood for ‘Algorithmic IT Operations’, has recently adopted a more topical moniker: Artificial Intelligence for IT Operations. But as both names suggest, AIOps are IT operations that are based on AI technologies and algorithms.
The quick development of search services has transformed search systems into veritable platforms. This process has seen a transition first from manual operations to automated script operations and then finally to DevOps. However, the rapid progress of big data and AI technologies has made it increasingly difficult for conventional operations and solutions to meet demands.
In an attempt to make platforms more efficient and reliable at a lower cost, Alibaba has launched two transformative tools — Hawkeye and Torch. Hawkeye is an online service optimization solution and Torch a capacity planning platform.
Hawkeye improves efficiency and reliability, primarily through smart diagnosis and optimization. The Hawkeye architecture is detailed in the following figure:
Hawkeye has three major layers: The analysis layer, web layer, and service layer.
Analysis layer
The layer is itself broken down into two sections:
· Hawkeye-blink
An engineering section for bottom layer analysis that processes data based on Blink, focusing mainly on access log analysis and full data analysis. Primarily analyzing bottom layer data, this engineering section uses Blink’s powerful data processing capacity to analyze the access logs and full data of all Ha3 applications on the search platform.
· Hawkeye-experience
An engineering section for one-key diagnosis. Using the analysis results of Hawkeye-blink as a basis, Hawkeye-experience conducts analysis that is closer to users, such as field information detection (including field type rationality and field value monotonicity). The kmon invalidity alarm, the recording of smoke cases, engine degradation configurations, memory-related configurations, recommended replica/partition numbers, and minimum service replicas when switching, among other detection functions, are also within the service spectrum of this section.
The engineering target of Hawkeye-experience is a middle ground of search engine diagnosis rules that incorporates the valuable experiences of operational staff into the system. This way, every new application can have instant access to these experiences, instead of hitting a whole host of hurdles along the way. We aim to give each user a role (such as smart diagnosis expert) to optimize their engines. A flow chart that depicts how Hawkeye-experience processes data is shown below:
Web layer
This layer outputs Hawkeye analysis results in various API and visual monitoring tables.
Service layer
This layer offers API output for Hawkeye analysis and optimization.
Alibaba has implemented the following diagnosis and optimization functions based on the aforementioned architecture:
· Resource optimization: Engine lock memory optimization (invalid fields analysis), real-time memory optimization, among others
· Performance optimization: TopN slow query optimization, buildservice resource configuration optimization
· Smart diagnosis: Routine inspection and smart Q&A, among others
The following sections detail the key function of each optimization — namely, lock memory optimization, slow query analysis, and smart Q&A.
For Ha3 engines, engine fields are divided into the inverse index (index), forward index (attribute), and summary index (summary). An engine’s lock strategy can be set to Lock Memory or Unlock Memory over these three indexes. The advantages of Lock Memory are self-evident, namely faster access and lower rt. But it is also highly likely that in 100 fields, only 50 are accessed in the space of a few months, with the others having zero access records in the index, which is a colossal waste of memory. To avoid this, Hawkeye analyzes and optimizes, downsizing the indexes for header applications. The figure below shows the process of Lock Memory optimization — a process that has saved millions of RMB in total.
Slow Query data comes from applications’ access logs. The number of queries is linked to the number of page views, which is usually in the tens or hundreds of millions. Getting TopN slow queries from massive logs falls under the umbrella of big data analysis. Relying on Blink’s big data analysis capacity, we can obtain the TopN slow queries by using divide-and-conquer + hash + min-heap. This process starts with parsing the query formats and getting the query time, followed by taking the md5 values of the parsed k-v data, and sharding according to the md5 values. In each sharding operation, calculate the TopN slow queries, and get the final TopN of all TopN values. Personalized optimization recommendations that are made according to the analyzed TopN slow queries are pushed to users to help them boost the engine query performance, which serves as an indirect way of increasing engine capacity.
The health score is Alibaba’s metric for measuring the healthy state of an engine and tells users how healthy their service is. The diagnosis report specifies the diagnosis time, provides descriptions and details of unsuitable configurations, and highlights the benefits of optimization. Shown below are the diagnosis logic and a page of faulty results after one-key diagnosis. The diagnosis details page is not shown here due to space limitations.
The growing number of applications has resulted in more and more questions being raised on the platform. In answering these questions, it became apparent that some were being asked time and time again, such as those on stopped increments and common resource alarms. Questions that can be handled in established procedures can be built with chatOps, after which they can be handled by an answering robot. Currently, Hawkeye combines the kmon metric and customizable alarm message templates, and handles the smart Q&A of these questions by adding diagnosis to the alarm texts. Users can paste alarm messages to the Q&A group and @ the robot. They will then receive a reason for the alarm.
While Hawkeye improves efficiency and reliability, Torch focuses on lowering costs via capacity management. Search platform applications are growing in number, and several issues (specified below) have to be properly addressed. If not, resources are used inefficiently and wastefully.
1. Businesses apply for container resources at will, resulting in inevitable waste. They need to be shown how to apply for resources (including CPU, memory, and disks) to help minimize container costs. Another option would be preventing users from managing their resources.
2. Businesses grow and shrink all the time, and nobody knows their true online capacity. What is the qps limit that can be supported? Does capacity need to be expanded when businesses experience higher traffic? What are the options for capacity expansion if it is indeed needed? Should the replicas be expanded or the CPU of individual containers be increased? What is the best plan of action when businesses need more data — unstacking columns or increasing the memory of individual containers? Each of these questions needs to be answered.
Let’s assume a scenario in which the current resource used is kmon data. The statuses of the online systems are reported to kmon. But here’s a thought: how about using the kmon data directly for capacity estimation?
Experiments have found that this is not enough. Many online applications run at low levels, so fitting capacity at high levels is unrealistic. That is why a stress test is required to get a better picture of the performance capacity. The question now is where to conduct the stress test? Online stress tests are risky, and pre-release stress tests do not provide a real picture of online capacity, especially considering the limited pre-release resources and poor device configurations. This is where clone simulation comes in. Cloning a single online instance for a stress test is both accurate and safe. So, with stress test data in place, the next task is to find the most inexpensive resource configurations through algorithm analysis. The task management module can then manage every task and estimate their capacity automatically.
This is the proposed solution:
The following sections give details on the system architecture of this solution and how clone simulation is realized within this architecture, as well as on the stress test and algorithm services.
If you look at the bottom of the figure, the access layer seems to jump out. To switch in a platform, just provide the application information and cluster information for all applications on the platform (currently, only Ha3 and sp on TISPLUS are switched in). The application management module will integrate the application information. Next, the task management module will abstract every application into capacity estimation tasks.
The process of capacity estimation is as follows. First, clone a single instance that is then subjected to an automatic stress test until it reaches the limit capacity. Stress test data and day-to-day data are then formatted in a data factory and submitted to the decision-making center. The decision-making center estimates the capacity with the algorithms service using the stress test data and day-to-day data and determines the gains. In the event that the gains are considerable, the decision-making center factors in algorithms and capacity optimization suggestions to carry out the clone stress tests for verification. If the verification result is Pass, the results are stored permanently. If the result is Fail, a simple capacity estimation will follow (roughly estimating the capacity by considering the performance limit found in the stress test). Once the capacity estimation passes or fails, the decision-making center clears the temporary resources applied to the clone or stress test, thereby helping to avoid waste.
At the top of the figure is the application layer. Considering that Torch capacity management is not meant for TISPLUS alone, the application layer provides capacity indexes, capacity estimation, capacity statements, and gains indexes, for switching in other platforms in an embedded manner. Capacity API is also provided for other systems to call.
Capacity estimation also relies on searching a number of other systems, such as maat, kmon, Hawkeye, drogo, and the cost system, which are all combined to form a closed loop.
Simply put, clone simulation is a single instance of an application on the clone line. The Ha3 application is about cloning entire replicas, and sp an independent service. Now that the powerful search weapon hippo is on the scene, resources are being used in the form of a container. The rise of DevOps platforms like suez ops and sophon has also made the fast cloning of an application a reality. Below are some concrete steps for cloning a management module:
There are two types of clone: shallow and deep. Shallow clones mainly apply to Ha3 applications, and directly extract the index of master applications using shadow tables, omitting the step of speeding up the clone. For deep clones, cloned applications need to be built offline.
The advantages of cloning are self-evident:
· Service isolation: Real online capacity can be estimated by stress testing the clone environment.
· Resource optimization suggestions can be verified in a cloned environment via a stress test.
· After use, the cloned environment is released automatically, without wasting online resources.
Considering that daily kmon data is mostly lacking in high-level metrics, and real engine capacity cannot be determined without an actual stress test, a stress test service is needed. An early-stage survey of both Alibaba’s Amazon stress test platform and the Alimama stress test platform found that neither could meet the demand for an automatic stress test. As a result, Alibaba developed a self-adaptive distributed stress test service based on hippo that adds a worker to apply stress.
The aim of capacity estimation is to minimize the resource costs and increase the resource utilization rate. This has one precondition, namely that resources have to be quantifiable by cost. Cost is an important dimension for running searches as platforms and for evaluating platform values. The Alibaba search team has therefore worked together with financial staff to formulate a price equation that satisfies this precondition. Much of the experimental analysis conducted with the algorithm teams found that this can be converted into a planning problem with constraints:
· The target function of optimization is the price equation (which includes variables such as memory, CPU, and disk).
· The constraints are that the specs and number of the provided containers must be able to guarantee the minimum qps memory and disks.
Launching Hawkeye and Torch on the TISPLUS search platform slashed costs, improved efficiency and reliability, and boosted Alibaba’s ability to apply AIOps to other online systems. The next target is to bring Hawkeye and Torch together into one AIOps platform and to give other online services access to the benefits of AIOps. Openness and accessibility have therefore become the two most important focuses of the platform.
To this end, constructing the following four operation libraries is the next major step:
· Metrics library
Standardizes and integrates information on the online system log, monitoring metrics, events, and applications, to ensure easy acquisition of various operational metrics.
· Knowledge library
Based on an ES-accumulated set of questions and on the experience of daily Q&A sessions to provide search and computation functions, facilitating the automatic diagnosis and self-healing of similar online problems.
· Components library
Modularizes clone simulation, stress tests, and algorithm models allowing users to flexibly select algorithms to implement strategies and uses clone simulation and stress tests to verify validity.
· Strategy library
Lets users quickly implement the operation strategies of their systems by allowing them to drag and write UDP using the canvas feature. The metric, knowledge, and components libraries offer diversified data and components, making it easy to implement operation strategies.
If the aforementioned infrastructures are built and combined with strategies, data from various operation scenarios can be produced. This can then be used for troubleshooting, smart Q&A, capacity management, performance optimization, and other applications in different scenarios.
If one thing is certain, it is that there is still a long way to go before it is possible to rid the world of poor search engines. From SaaS capacity to treating search algorithms as products and DevOps and AIOps on the cloud, there are still a number of challenges ahead.
(Original article by Cai Yunlai蔡云雷 & Li Xuefeng李雪峰)
First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.