1st-hand & in-depth info about Alibaba's tech innovation in AI, Big Data, & Computer Engineering
How the Alibaba Tech Team is using DevOps to improve the agility and adaptability of their search enabling platform
This article is part of the Search AIOps mini series.
At the end of 2015, the Alibaba tech team released its enabling platform strategy for constructing organization and business mechanisms based on the creative and flexible concept of “Large Enabling platform, Small Frontend.” The basis of this concept is making the frontend agiler and giving it the ability to quickly adapt to the market, while the enabling platform is used to incorporate digital operation and products and technologies capacities to provide strong support for various frontline services.
A crucial link in Alibaba’s enabling platform layout chain is allowing the enabling platform to handle search. However, this exposes the search enabling platform to huge challenges regarding techniques and products as a result of the intrinsic complexity and large business scale of search technologies.
The Alibaba search enabling platform was established to support frontend services by increasing their agility, helping them adapt more quickly to changes in the market, and ultimately to eliminate inconvenient search functions. To accomplish this goal, the Alibaba tech team constructed their search enabling platform from scratch over a three-year period, accumulating cutting-edge knowledge and experience in the running of DevOps, AIOps, and offline services on the platform.
The following figure shows the predicted development trends for the search enabling platform over a three-year period, including a summary of its implementation.
The three key developmental stages represented in this figure are manual control, automated script operations, and integrated development and operations.
During this stage, the operation of both the search service department and open source search technologies relied entirely on human efforts. Massive amounts of manpower were wasted on redundant, inefficient jobs. However, PE gradually accumulated experience over time and determined that automated scripts could be used to implement common, repeated operations and save manpower and improve operation efficiency at a time when expertise and domain know-how were only beginning to accumulate.
This stage was characterized by the use of open source technology systems. However, working in this fashion naturally separated developers and operators and placed the two roles in opposition to each other. Developers required fast iterations, while operators focused on keeping fewer iterations to maintain online stability. This separation led to a mutual distrust due to online failures resulting from configuration changes and software updates. Ultimately, the two sides reached a compromise that updates would be released every Tuesday and Thursday during the release window. However, this compromise resulted in lower business operation efficiency, which in turn created a large gap between system capacity and the demand on iterations from the business side.
This is the current stage of development for the Alibaba search enabling platform.
To solve the issues with automated script operations, the team developed a new control system based on DevOps for integrating development and operations that provides better solutions for iteration releases.
As the business scenarios are essentially a technical systems management process, the team believes DevOps should be used as something beyond just a simple methodology for the integration of development and operations for individual systems. Alibaba hope to establish DevOps as the “ops” above individual system ops. This intrinsically separates the work from other Alibaba DevOps platforms and is represented by the Apsara Base that manages end-to-end processes from deployment to service source codes updates.
In essence, Apsara Base users are still considered operators. Therefore, Apsara Base uses IAC (Infrastructure As Code)’s dimension +Git management and deployment configurations to create products. This is a typical approach for designing a DevOps platform and meets the main job requirements.
However, Alibaba is often faced with end users who lack expertise in online system operations and get lost when all they can see are configurations or codes. Fundamentally, the team must advance their understanding of DevOps and move towards viewing platforms as products. For this purpose, the team must avoid exposing configurations, codes, and the complexities of industry expertise to users, and transform system collaboration to control the end-to-end experience. Radical improvements in the iteration efficiency of complex search can only be achieved by simplifying processes and controlling the end-to-end full link experience.
Years of effort were put into trying these two approaches, which resulted in the implementation of a series of DevOps systems, including Sophon-Bitmain, Bahamut, and Maat. Specific details for these undertakings are described in detail in subsequent sections.
Before examining the Sophon-Bitmain system for integrating development and operations, it is important to understand which systems are involved and how they work together when switching to a service with complex search scenarios.
The system outlined in the previous figure is divided into three modules: OPS, Online, and Offline. As shown in the figure, the Ops layer is divided into online stateful service ops, online stateless service ops, and offline ops. In other words, each service separately relies on OPS for control. However, the previous figure also shows the complicated transaction results from the collaboration of multiple service systems.
Before launching tisplus and before switching in a complex project, the online stateful service team, online stateless service team, offline DUMP team, business side, and PE were brought together to exchange opinions on how to release the project cooperatively. After release, online changes and troubleshooting were performed amid frequent tense exchanges in the support group, which hurt efficiency and only supported ten individual business units.
With these pain points in mind, the team can now go back examine must-have features for the construction of DevOps as a complicated search system:
1. Full link OPS that offers an end-to-end experience and accurately matches the definition of DevOps for complex scenarios.
2. Common-sense, process-oriented operations updates in a complex operation control link to target-driven operations control.
3. Decent operations and products abstraction and better enabling of users.
4. Improved business iteration efficiency as the foundation for ensuring business reliability.
Alibaba created the Sophon-Bitmain platform to address these pain points, which is elaborated on in subsequent sections.
Many users are unfamiliar with target-driven operations and initially find the concept too abstract. However, examining operation scenarios from real searches can help achieve a better understanding of the need for target-driven operations control.
Consider the following example: The index system is using index Version A and is requested to switch to Version B. When the system is rolling out Version B, a new instruction is received to switch to Version C. For previous methodologies, switching in this manner resulted in crashes. PEs could only perform the process successfully by killing the current switch process, checking which step each node advanced, clearing intermediate processes, and releasing the operations. This demonstrates how unproductive process-oriented operations control can be under complicated operation systems.
Conversely, if the same scenario is encountered under a target-driven process, the switch is performed by setting up a new Version C for rolling by the system. This updates the system of the latest target, which is compared with the currently executed progressive target. Once detection changes to the target, the system immediately terminates the current execution route, automatically clears inconsistent states, and begins distributing the notice to execute critical routes of the latest target state. Upon receiving the latest orders, each node starts progressing toward the new target.
This progressive and consistent operation method naturally shields the complexity of intermediate operation states, making complex operations control more simple and flexible. Alibaba’s operation platforms have all been changed to target-driven operation from the top down due to these benefits.
Another commonly-used method for simplified operations is replacing hosting with enabling, meaning that users must shoulder more responsibilities before enjoying more powerful search abilities. However, enabling end users does not mean they should be exposed to the abstruse, complex concepts of search systems and their operations. Simplifying the system’s operation concepts and sealing complex information and industry know-how inside the system is one of the core tasks of Alibaba’s Sophon-Bitmain.
The lower portion of the previous figure displays the infrastructure and online services for each data center from the viewpoint of PE. Without a control abstract layer, users are exposed to the same level of complexity as PE, which can be extremely confusing. Sophon-Bitmain limits complexity in several ways. First, it abstracts objects into a group of data relation models called operation control models (as shown on the right side of the previous figure). However, the resulting operation abstract layer is still too complex for users, who should only be shown specific business abstracts containing business scenarios.
As a solution, Sophon-Bitmain added a business abstract layer to the first layer of the operation abstract. An example is the three concept layers in the top left corner of the previous figure: business logic (plugin, configuration), service (deployment relations), and data (data source & offline data processing). Users can accept the definition of this layer at almost no cost. Therefore, Sophon-Bitmain’s ability in abstracting operations and simplifying business concepts makes it possible to switch from hosting to enabling users.
Sophon-Bitmain guarantees service reliability in several aspects. As Sophon-Bitmain supports an increasing number of leading core businesses, Alibaba must provide SLA guarantees to search services, and respond to each business’ reliability requirements by deploying online and offline services flexibly. Automatic disaster tolerance switchover is also a required feature.
At its current service reliability level, Sophon-Bitmain supports the unitization of online and offline search service, cold standby deployment of offline data, and automatic disaster tolerance switchover of the query and data backflow links, as shown in the following figure:
One indicator of improved iteration efficiency is that iterations can now be released anytime, anywhere, instead of using the previous time window-based online release method. However, this does not mean release can be performed at will without considering potential risks for quick iteration releases.
To achieve the target of safe and efficient iteration releases, Alibaba designed and standardized a set of iteration release codes. For example, a normal business iteration must be verified in both daily and pre-release environments. The team also added a multi-layer authentication mechanism to ensure the reliability of releases. For example, when upgrading plugins and algorithm strategies, the team requests that pressure tests be performed on a clone. If the performance degrades too much, the release process is rejected. In addition, single-data center tangential flow phased release, smoke authentication, and similar features can be defined in launch procedures.
These changes have given Sophon-Bitmain powerful multi-layer authentication and quick disaster tolerance switch over abilities, while the risks of quickly-released business iterations and improving iteration efficiency.
While the functions of search technology systems are strong, they operate under the control of numerous house rules (e.g., expertise requirements for complex operations control and business iterations) that overwhelm everyday users.
For example, in search scenarios, changing a single field on the business side can necessitate multiple online and offline changes to associated configurations. Involving users in this process can require them to make complex judgments, such as:
· Will online services, offline services, or both be affected?
· Will pushed configurations take effect online or offline first?
· Will all configurations take place simultaneously following a full backup?
Requiring users to make this sort of judgment can result in them feeling irritated and confused. Therefore, it is important for Alibaba to utilize the engine service expertise from the Sophon-Bitmain DevOps layer to contain these complexities within the platform, and manage all associated complex knowledge and decision-making behind the scenes. The operations platform can break down and then execute complicated operations internally according to the pre-defined expert operations DAG drawing in organized stages, as shown in the following figure:
Continuously inputting operational expertise into the system (execution flow of the operation DAG), decreases the platform’s usage costs while increasing the efficiency of iterations. In situations where operations are increasingly complex (for example, when the operations exposed to users cover more and more services), the operations DAG execution link can evolve from a simple stage to a complicated stage where multiple execution branches exist. Determining the optimal execution link also becomes a detailed undertaking (as shown on the right side of the previous figure). Alibaba refers to this operation as “shortest path routing.” This represents an attempt at achieving smart operations, as well as indicating the direction Alibaba is taking for future efforts in this discipline.
Before discussing offline platform technologies, it is important to cover some background on the universal demand placed on offline processing by search. This has been a common theme for offline cross-team discussions since before offline platforms for complex business were established.
For engines, business data does not just consist of a simple database table, but rather it can come from a number of homogeneous data sources or heterogeneous data sources. Every search business demands full backup and incremental backup. Therefore, it is important to determine how the team turns the different data source relations corresponding to businesses into the upper-layer abstract, block internal processes, and consolidate incremental backup and full backup.
Implementing full backup codes and incremental backup codes for every new business is not ideal. Looking back at the previous work, Alibaba’s insufficient offline support boiled down to engine schema-defined data sources being weakened to resource pairs for abstraction and management, resulting in a failure to extract fundamental abstracts.
Currently, all switched-in data resources are Dynamic Tables. When defining these resources with a table abstract, certain universal APIs can converge and remove the need for repeated development, including:
· Creating tables
· Deleting tables
· Modifying tables
· Adding, deleting, modifying, and querying table data
· Defining table relations,
This inspired Alibaba’s approach and overall mentality when it established the offline components platform, bahamut.
The platformization of online components only platformizes offline data processing abilities. A search scenario is always a dynamic process featuring online and offline collaboration and combination. All the business scenarios are, as mentioned previously, the result of coordinating technical systems. The most important and challenging part of this process is ensuring efficient online and offline collaboration and offering users an end-to-end experience.
The following figure shows how when using offline data, end users always see visual data relation definitions and the simple execution list dump->Build->switchindex. Behind the system is a sophisticated state machine that manages online and offline collaboration, as the team have blocked all complexities.
Next, the team will share how they transformed the personalized demand of every online search business on offline data processing into an abstract, and finally, satisfied the demand via a platform.
The bahamut platform supports users to define relations between their respective data source information and tables (the team supports joint operations between heterogeneous tables, such as odps and mysql). The team then submits this frontend graph to Bahamut for translation. Bahamut then parses, optimizes, splits, and translates this graph into several graphs that can be executed by blink, including:
· Incremental backup syncBlink tasks
· Full backup BulkLoad MR tasks
· Blink Join tasks.
The two most critical graph nodes in this instance are merge and left join. Merge transforms the processing of all 1:1 and 1:N relational tables into an HBASE staging table. Regarding the processing of N:1 relational tables, driving is currently only supported on the side of the master table N (merchandise table). In other words, N side tables are updated via blink sync and then merged with tables on the 1 side (user tables) via blink Join1 to form complete row records. These row records are then sent to SwiftSink (incremental) and HDFSSink (full). Finally, they flow back to BuildService (an index construction service) to construct indexes, as shown in the following figure:
Online and offline control and collaboration, together with the construction of the Bahamut components platform, gives users powerful ability via a visual means to process and compute complicated offline data relations. This ensures significantly more efficient business support and highlights the platform as a milestone in terms of its capabilities for offline consolidation and providing an end-to-end offline experience.
Alibaba is also working on converting offline capacity to general online service capacity. Alibaba believe that in the near future, the offline component platform will not remain focused on HA3 search scenarios only, but it will extend to online search services in their entirety.
(The original article is written by Liu Ming柳明)