1st-hand & in-depth info about Alibaba's tech innovation in AI, Big Data, & Computer Engineering
As the size of Alibaba’s data center (DC) grows with the expansion of its business footprint, the increasing frequency of everyday operations like drills and optimization introduces exponential levels of complexity into the system. This makes it increasingly difficult for DC experts to forecast the failures a change to the system could cause, while the business impact of DC failures continues to rise beyond estimation.
To counter such problems, DC operators need a reliable, standardized verification system that can indicate whether changes will affect DC safety and whether more suitable options are available. Now, following a year of dedicated efforts, the collaborative Alibaba-NTU Singapore Joint Research Institute has implemented and launched a sandbox system based on online Computational Fluid Dynamics (CFD) for high-precision, real-time monitoring.
This article looks at the development of this original sandbox in detail, from its origins in research to testing and verification on off-the-shelf CDF software.
Compared to changes of power topology, the more intangible thermal and airflow organization changes involved in heating, ventilation, and air conditioning (HVAC) adjustments are relatively difficult to simulate in the real world. To this end, Alibaba’s IDC operation optimization team has conducted extensive research on how CFD can effectively achieve production standards for DC room simulations.
CFD offers a general DC room simulation solution for checking the thermodynamics of various changes in the DC room. CFD modeling can be used to calculate airflow distribution and temperatures inside a compartment by constructing a physical model and loading specific thermodynamic settings, such as heat source volumes and cold supply airflow volumes. As such it is a relatively mature technology that is widely used in aerodynamics- and thermodynamics-related fields. For data centers, CFD simulation applications range from the compartment level down to the chip level but are generally used only for pre-design and planning purposes due to their precision limitations.
For the purposes of building a sandbox system that will serve as a digital testing environment, CFD presents a number of challenges.
First, commercially available CFD software obtains heat and airflow pattern data based on compartment simulation, with incomplete information at the design stage leading to low accuracy. As such, rough simulation results can diverge significantly from real scenarios. In temperature prediction, for example, simulations can miscalculate by three degrees Celsius or more, while the sandbox system demands higher precision.
Second, existing CFD software is designed for human-computer interactions, rather than automated operations. It thus cannot meet the requirements of automatic data acquiring and results exportation, making it difficult to establish as an important block of the DC operation pipeline.
Lastly, critical heat source and air conditioning configuration data needed for simulation can only be obtained at runtime. This data can only ensure accurate modeling if it has been accurately verified. Alternatively, the modeling process can infer the difference between design and implementation and can promote the standardization of storage for DC configurations and operational data.
Alibaba’s successful launch of a high-precision CFD-based sandbox with real-time monitoring reflects almost a year of research, development, and tests in collaboration with Professor Wen Yonggang of the School of Computer Science and Engineering at Nanyang Technical University, Singapore. This effort designated an Alibaba data center room as the pilot test-bed, with an emphasis on physical modeling, model calibration, and project implementation for the target data center.
The physical modeling process primarily defines the physical structures of DC room spaces, and thus provides the basis for simulation. To achieve the highest possible accuracy in reproduction, the project team implemented several modeling operations.
With structural modeling, the room structure, walls, vents, ceilings, and pipe layout for the center were set-up; with IT deployment modeling, rows of cabinets and server slot positions were set-up; with environmental modeling, the air conditioning equipment and sensors were configured in the model; with device modeling, the servers were installed into the model according to vendor type.
Model calibration aims to achieve a true reproduction of three key aspects of the DC room.
First, it needs to reproduce the amount of heat generated and supplied cooling source in the DC rooms. Second, it needs to reproduce all causes of change in airflow in the DC room and to confirm that the resulting airflow model is consistent with actual conditions. Finally, it needs to ensure that the model’s predicted temperature for the room is consistent with actual conditions.
To ensure the model can reach industrial-grade precision, the project team carried out extensive data auditing and model adjustment work. Through these efforts, all relevant information and settings for the entire DC room were comprehensively combed and verified, and a complete set of standardized calibration documents was accumulated, laying a foundation for modeling and promotion. These efforts toward calibration can be viewed in terms of two categories: data auditing, and model adjustment.
Data auditing can again be divided into two categories: server auditing (such as server position conflict cleaning and server power consumption calibration) and sensor auditing (including air supply temperature, ACU fan speed calibration, and the positions and readings of sensors in the hot and cold aisles).
Model adjustments include adjusting hot air leak settings (as hot air leaks will cause cold channel temperatures to rise), adjusting the simulation mode to meet the accuracy requirements, and adjusting server air volume settings.
The above calibration operation enabled the final model to reach a level of precision never before achieved by a design-phase CFD model. This precision owes to the team’s accurate reproduction of hardware layouts, data auditing at the level of each operation, and fine-grained calculation of server flowrates.
The above flow chart illustrates the implementation of the CFD-based sandbox. After meeting the predefined target for model accuracy, the team made a further step toward CFD simulation automation; by accessing Alibaba’s self-developed data center infrastructure management system (DCIM), the team obtained data for real-time server power consumption and air conditioning settings, and wrote this data into the CFD model with a data exchange layer to enable the model to simulate real DC conditions in real time. This means that once a simulation is completed, simulation results (including images and data sheets) are automatically exported and transferred to the DCIM system.
With these steps, the team realized the integration of the sandbox system into the DCIM system, such that the entire process could be automated. This laid a solid foundation for further application and generalization of the sandbox in the future.
In terms of model accuracy, the team used real monitoring data as the input for the sandbox system to compute the mean absolute error (MAE) between the predicted and the real monitoring data for designated sensors in the cold and hot aisles. With tests over a sufficiently long time period, its accuracy met the requirements of Alibaba’s DC standards. Theoretically, the sandbox system can replace the sensors in the DC room to monitor operating states.
In terms of successful implementation, at present the model has successfully accessed the DCIM and can automatically retrieve data from DCIM and return results. The current simulation time is around one hour, and this is expected to speed up to just 10 minutes. With this live CFD simulation system, Alibaba’s DCIM system has become the only data center cloud management system worldwide to provide high-precision, real-time CFD simulation modules.
The CFD-based sandbox system offers key advantages in DC visualization, fault discovery, design verification, design optimization, HVAC control recommendation, and server dispatch communication.
Regarding DC visualization, the model upgrades the display mode from a 2D digital visualization to a 3D digital and graphical visualization covering 3D layout, thermal status, and airflow patterns. With it, data center management experts gain a better assessment of the state of the DC room. With the sandbox system, operators can quickly identify heat and ventilation issues occurring in the room.
Regarding fault discovery, the simulation results are of centimeter-level granularity, enabling detection of fine-grained temperature rises (or local hot spots). This supports faster, stronger risk identification capabilities to prevent the emergence of temperature rises.
For design verification purposes, the physical setup information needed for the modeling process is usually determined at the design phase. The error feedback information obtained during the modeling process can directly verify differences between design and implementation.
With design optimization, the model can guide changes and simulate the operation of different data center designs. It can thus be used as an a priori platform for design optimization and data center changes.
For HVAC control recommendations, the model allows users to apply different air conditioning settings to determine the best cooling control approach from a perspective of lowest energy consumption, thus achieving reliable, smart control of DC cooling.
Finally, regarding service dispatch recommendation, the model can provide reference to the service dispatch system with the detailed temperature distribution in the DC room. The service distribution can achieve more uniform temperature distribution to reduce cooling energy consumption and improve server health.
One future direction for the model is to develop it into an industry-level application with the goal of making it an industry standard. The sandbox system can be applied to a greater number of DC rooms in the Alibaba data center to further validate DC design and optimize operational controls.
In the future, the team hopes to extend the sandbox system to the entire HVAC system, including elements such as refrigeration equipment located outside the DC room. This can realize simulation for the full cooling chain and achieve design verification and control optimization for the entire HVAC system.
In summary, the sandbox system will significantly promote automation levels in areas ranging from data center design to operation and maintenance, and will provide support for more stable and efficient data center management. This reflects an effort to move “from zero to one”, as it marks the first completion of a real-time, high-precision sandbox system to help data center operators validate whether a change will cause issues, thus reducing the chance of outages. Further, Alibaba can apply the sandbox system to building operation recommendations and even to achieving automatic DC reconfiguration, among other operations. In this way, Alibaba is approaching its goal of unattended DC management.
(The original article is written by Chen Li陈丽)