How is a new era of data challenges changing operation and maintenance for IT? Best practice form the Alibaba tech team.
This article is part of the AIOps for Big Data series.
The heart of artificial intelligence is data, and companies with a lot of it are constantly working to boost what it can do for them. Beyond that simple fact, though, many continue to wonder what it means to say that data is involved in development practices like AIOps (AI for IT operations), or how data can be used in place of human and even machine-driven analytics.
Further questions abound. How can we use machine learning algorithms together with big data-based business operation and maintenance platforms? How can machine learning enhance alarm filtering, anomaly monitoring, automated repairs, and other tasks to truly liberate operation and maintenance?
Faced with these questions, Alibaba is moving away from a belief in AIOps as a long-term evolution toward a data-centric approach to IT. In that spirit, the group has departed from common industry practices and invested in a robust foundation in DataOps — an automated, process-oriented methodology that data analysts use to improve the quality and rate of analysis cycles.
In this article, we look at the challenges and opportunities facing operation and maintenance teams as they move beyond outdated practices and into the data-driven era.
From ScriptOps to AIOps, Level by Level
Scripted operation and maintenance
- Script replaces manual operation
- Execution: human + script
- Decision-making: human
Automated operation and maintenance
- Most operation and maintenance work is done automatically or by processes
- Execution: human + system
- Decision-making: human
Highly automated + single-point intelligence
- Operation and maintenance is done by data system construction
- Execution: human + system (80%)
- Decision-making: human + system (20%)
L4: DataOps (advanced)
Highly automated + series intelligence
- Main operation and maintenance scenes are implemented by processes and free of intervention
- Execution: human + system (95%)
- Decision-making: human + system (80%)
Fully automatic smart operation and maintenance
- Can be easily adjusted between cost, quality, and efficiency
- Execution: system (100%)
- Decision-making: human + system (95%)
Rough Beginnings: ScriptOps
Operation and maintenance work requires a high level of skill, and the scope of the work exceeds other IT fields. Nevertheless, many think of it as limited to releases, modifications, alerts, and device migration, generally reflecting dated practices known together as ScriptOps.
In some ways, this is not a bad sense to have. All big Internet companies begin as small companies where these issues (and all manner of other problems) threaten the company’s survival. Pressure and the pursuit of short-term results, though, have led many to rely on simplistic solutions from online technical forums or even personal blogs, leaving a legacy of misunderstanding that today’s professionals must move beyond.
A Case for ToolOps
The view described above is more than an outside misconception. Anyone who has led newcomers in the field is likely aware of their tendency to deploy one-click batch release software, one-click cleanup, interactive wizard execution, or other “black screen” scripts. Often, they simply re-implement some such solution according to their personal sense of it, failing to grasp the potential for mishap in different scenarios. This invites inefficiencies and security risks, and the history of the Internet is riddled with the disastrous consequences of mistakes as simple as typing in the wrong characters.
Today, it is better understood that novices should not be left to run free on systems they have a limited grasp of. Instead, there is an ongoing push to merge more and more functional scripts into workable tools that can ensure the effective handover of the capabilities they provide — ToolOps, for short.
Shifting to Platform-Based DevOps
When an Internet company’s commercial success raises the scale of its operations, quantitative changes begin to create qualitative changes at the data level. Today, operation and maintenance for a large factory setting demands entirely new computing practices, and simply adding staff is not a solution.
Put another way, when an application grows from hundreds of platform units to tens or hundreds of thousands, data processing changes from a simple matter of CPU, memory, and mechanical hard disks to an elaborate mix of GPUs, FPGAs, ASICs, Optane SSDs, and other hardware, software, and big data distributions.
As issues threaten a large platform’s business and resources, data workers often face tasks bordering on the impossible. At such times, the operation and maintenance job description more closely resembles:
· Global architecture planning
· Resource operation and cost optimization
· Automated platform development
· Stability protection
· Massive data analysis
· Any number of unforeseen scenarios…
For Alibaba, developing platforms to assist operation and maintenance workers in these cases is now a given.
Entering the DataOps Phase
As Alibaba’s business grows, its operation and maintenance capabilities have likewise grown in depth and precision. Through software engineering and data-based innovation, operation and maintenance tools must adapt to handle ultra-large-scale distributed cluster management and improve the stability, efficiency, and cost of the overall product. This presents tremendous challenges for operation and maintenance personnel and sets a high requirement for skill on their part.
Simultaneously, the broader industry has also evolved toward a prevailing concept of AIOps. The field as a whole is pushing for greater awareness of these practices, driven by the idea that a powerful algorithm can replace the intelligence now afforded by human labor. Total automation is still an ambitious goal beyond today’s reality, much like driverless transit.
At Alibaba, the prevailing thought is that if an algorithm is the kernel, then its value depends on the amount of engineering devoted to implementing it. This essentially describes the thinking behind the DataOps stage, in which data figures in all operation and maintenance goals, and data-driven operation and maintenance has been effectively implemented.
The following image illustrates the aforementioned comparison to autonomous driving.
In the big data era, operation and maintenance personnel need to cultivate a new set of skills for a transforming industry. Key abilities now include architectural skills, research and development, operation and maintenance business sense, algorithm engineering, and TPM (technical project management) capability.
Developing an effective AIOps is essential to Alibaba’s operation and maintenance platforms and products. Beyond initial construction, human input and participation will continue to define progress toward a working model for the long term. This requires a large number of workers, as well as experts and analysts of different backgrounds and business calibers. Beyond people, it involves supporting visualization technology, machine learning technology, big data analysis, development scenario analysis, and platform landing, all contributing to the ultimate business value developers are seeking.
Transforming an entire operation and maintenance team is often a painful process. Changing qualifications naturally bring about organizational changes, and the impact on veteran personnel is relatively high. From maintenance to research and development, the first changes are always ideological. In technical innovation, there are always initial principles, followed by a gradual implementation of the new project, while traditional operation and maintenance workers ensure ongoing stability.
From putting out fires to transforming operations, this kind of pain is also part of the transformation any team has to go through. Ultimately, this should be a shift from problem-driven work to value-driven work, from operation and maintenance labor to operation and maintenance development, and from relying on experience to relying on smarter insight. It means the transformation not only of technological capabilities but also of operational and systemic thinking, in the spirit of embracing change and evolving with the times.
(Original article by Ke Min柯旻)
This article is part of the Big Data AIOps series.