This article presents the collaboration of Alibaba, Alluxio, and Nanjing University in tackling the problem of Deep Learning model training in the cloud. Various performance bottlenecks are analyzed with detailed optimizations of each component in the architecture. This content was previously published on Alluxio's Engineering Blog, featuring Alibaba Cloud Container Service Team's case study (White Paper here). Our goal was to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment, which resulted in over 40% reduction in training time and cost.
Background
Artificial neural networks are trained with increasingly massive amounts of data, driving innovative solutions to improve data processing. Distributed Deep Learning (DL) model training can take advantage of multiple technologies, such as:
The merger of these technologies as a combined solution is emerging as the industry trend for DL training.
Data access challenges of the conventional solutions
Data is often stored in private data centers rather than in the cloud for various reasons, such as security compliance, data sovereignty, or legacy infrastructure. A conventional solution for DL model training in the cloud typically involves synchronizing data between the private data storage and its cloud counterpart. As the size of data grows, the associated costs for maintaining this synchronization becomes overwhelming:
A hybrid solution that connects private data centers to cloud platforms is needed to mitigate these unnecessary costs. Because the solution separates the storage and compute architectures, it introduces performance issues since remote data access will be limited by network bandwidth. In order for DL models to be efficiently trained in a hybrid architecture, the speed bump of inefficient data access must be addressed.
We designed and implemented a model training architecture based on container and data orchestration technologies as shown below:
Core components of system architecture
Recently introduced features in Alluxio have further cemented its utility for machine learning frameworks. A POSIX filesystem interface based on FUSE provides an efficient data access method for existing AI training models. Helm charts, jointly developed by the Alluxio and Alibaba Cloud Container Service teams, greatly simplify the deployment of Alluxio in the Kubernetes ecosystem.
Initial Performance
In the performance evaluation using the ResNet-50 model, we found that upgrading from NVidia P100 to NVidia V100 resulted in a 3x improvement in the training speed of a single card. This computing performance improvement puts additional pressure on data storage access, which also poses new performance challenges for Alluxio's I/O.
This bottleneck can be quantified by comparing the performance of Alluxio with a synthetic data run of the same training computation. The synthetic run represents the theoretical upper limit of the training performance with no I/O overhead since the data utilized by the training program is self-generated. The following figure measures the image processing rate of the two systems against the number of GPUs utilized.
Initially, both systems perform similarly but as the number of GPUs increases, Alluxio noticeably lags behind. At 8 GPUs, Alluxio is processing at 30% of the synthetic data rate.
Analysis and Performance Optimization
To investigate what factors are affecting performance, we analyzed Alluxio's technology stack as shown below, and identified several major areas of performance issues. For a complete list of optimizations we applied, please refer to the full-length whitepaper.
1. Filesystem RPC overhead
Solution:
alluxio.user.metadata.cache.enabled
to true
. The cache size and expiration time, determined by alluxio.user.metadata.cache.max.size
and alluxio.user.metadata.cache.expiration.time
respectively should be set to keep the cached information relevant throughout the entire workload. In addition we extend the heartbeat interval to reduce the frequency of worker updates by setting the property alluxio.user.worker.list.refresh.interval
to be 2 minutes or longer.
2. Data caching and eviction strategies
Solution:
alluxio.user.ufs.block.read.location.policy
to alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
, which will avoid evicting data to reduce the load on each individual worker, and setting alluxio.user.file.passive.cache.enabled
to false to avoid caching additional copies.
3. Alluxio and FUSE configuration for handling numerous concurrent read requests
Solution:
max_idle_threads
as a configuration value, we patched its code to be able to configure a much larger value to better support more concurrent requests.
4. Impact of running Alluxio in containers on its thread pool
Solution:
-XX:ActiveProcessorCount
.
Results
After optimizing Alluxio, the single-machine eight-card training performance of ResNet50 is improved by 236.1%. The performance gains are reflected in the four-machine eight-card scenario with a performance loss of 3.29% when compared against the synthetic data measurement. Compared to saving data to an SSD cloud disk in a four-machine eight-card scenario, Alluxio's performance is better by 70.1%.
The total training time of the workload takes 65 minutes when using Alluxio on four machines with eight cards each, which is very close to the synthetic data scenario that takes 63 minutes.
In this article, we present the challenges of using Alluxio in a high-performance distributed deep learning model training and dive into our work in optimizing Alluxio. Various optimizations were made to improve the experience of AlluxioFUSE performance in high concurrency reading workload. These improvements enabled us to achieve a near-optimal performance when executing a distributed model training scheme in a four-machine eight-card scenario with ResNet50.
For future work, Alluxio is working on enabling page cache support and FUSE layer stability. Alibaba Cloud Container Service team is collaborating with both the Alluxio Open Source Community and Nanjing University through Dr. Haipeng Dai and Dr. Rong Gu. We believe that through the joint effort of both academia and the open-source community, we can gradually reduce the cost and complexity of data access for Deep Learning training when compute and storage are separated to further help advance AI model training on the cloud.
Special thanks to Dr. Bin Fan, Lu Qiu, Calvin Jia, and Cheng Chang from the Alluxio team for their substantial contribution to the design and optimization of the entire system. Their work significantly improved the metadata cache system and advanced Alluxio’s potential in AI and Deep Learning.
Authors
Yang Che, Sr. Technical Expert at Alibaba Cloud. Yang is heavily involved in the development of Kubernetes and container related products, specifically in building machine learning platforms with cloud-native technology. He is the main author and core maintainer of GPU shared scheduling.
Rong Gu, Associate Researcher at Nanjing University and core maintainer of the Alluxio Open Source project. Rong received a Ph.D. degree from Nanjing University in 2016 with a focus on big data processing. Prior to that, Rong worked on R&D of big data systems at Microsoft Asia Research Institute, Intel, and Baidu.