Authors:
(1) Iason Ofeidis, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};
(2) Diego Kiedanski, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};
(3) Leandros TassiulasLevon Ghukasyan, Activeloop, Mountain View, CA, USA, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven.
This section describes several efforts in the community to benchmark deep learning libraries, models, and frameworks.
A large body of work exists towards benchmarking deep learning tools and methods. MLPerf (Mattson et al., 2020) is arguably the most popular ML benchmarking project for modern ML workloads that targets both training and inference, spanning a variety of AI tasks. The authors use as their objective metric the training time required to reach a given accuracy level. This metric requires increased computational resources and is not well suited for testing dataloader parameters. DeepBench (Baidu-Research, 2020) is an opensource project from Baidu Research focused on kernel-level operations within the deep learning stack; it benchmarks the performance of individual operations (e.g., matrix multiplication) as implemented in libraries and executed directly on the underlying hardware. Similarly, AI Matrix (Zhang et al., 2019) uses microbenchmarks to cover basic operators, measuring performance for fully connected and other common layers, and matches the characteristics of real workloads by offering synthetic benchmarks.
Comparison of frameworks: This section includes efforts toward benchmarking and comparing different deep learning frameworks, such as PyTorch, TensorFlow, etc.
In Deep500 (Ben-Nun et al., 2019), authors provide a modular software framework for measuring DL-training performance; while customizable, it lacks hyperparameter benchmarking and does not provide an easy-to-use way to add and experiment with novel libraries and workflows. AIBench (Gao et al., 2020), and DAWNBench (Coleman et al., 2019) are both end-to-end benchmarks, with the latter being the first multi-entrant benchmark competition to measure the end-to-end performance of deep-learning systems. As with MLPerf, none examine the effect of alternative loading libraries in their workflows. In (Wu et al., 2019), the authors present a systematic analysis of the CPU and memory usage patterns for different parallel computing libraries and batch sizes and their impact on accuracy and training efficiency. This analysis is close to our work; however, it does not provide an open-source resource to interact with and benchmark new libraries.
In (Shi et al., 2016), the authors compare deep learning frameworks based on the performance of different neural networks (e.g., Fully Connected, Convolutional, and Recurrent Neural Networks). dPRO (Hu et al., 2022) focuses on distributed (multi-GPU) training benchmarks by utilizing a profiler that collects runtime traces of distributed DNN training across multiple frameworks. DLBench (Heterogeneous Computing Lab at HKBU, 2017) is a benchmark framework for measuring different deep learning tools, such as Caffe, Tensorflow and MXNet. In (Liu et al., 2018) authors study the impact of default configurations by each framework on the model performance (time and accuracy), demonstrating the complex interactions of the DNN parameters and hyperparameters with dataset-specific characteristics. Yet, the experiments include only the default configurations of each framework and lack any analysis of non-default settings. In (Wu et al., 2018), the authors test default configurations of frameworks and attempt to find the optimal ones for each dataset; they also examine the data loading process but do not evaluate third-party libraries. All the previously published works in this paragraph, while they bear numerous similarities with our work, they have one significant distinction with it; they do not conduct any analysis or benchmarking on PyTorch or the ecosystem of libraries for data loading described in this paper, which, as stated in the introduction, is currently one of the most popular deep learning frameworks that are widely utilized both in industry and academia.
Comparison of different DNN architectures and hardware: ParaDNN (Wang et al., 2020) generates parameterized end-to-end models to run on target platforms, such as varying the batch size to challenge the bounds of the underlying hardware, but focuses on the comparison of specialized platforms (TPU v2/v3) and device architectures (TPU, GPU, CPU). Relevant to ParaDNN is the work of (Bianco et al., 2018), which provides a comprehensive tool for selecting the appropriate architecture responding to resource constraints in practical deployments and applications based on analysis of hardware systems with diverse computational resources. However, it concentrates more on the design of deep learning models than the deep learning frameworks these are implemented on. While Fathom (Adolf et al., 2016) and TBD Suite (Zhu et al., 2018) both focus on the evaluation of full model architectures across a broad variety of tasks and diverse workloads, they are limited on these and lack benchmarks for state-of-the-art training innovations.
Other Devices: AI Benchmark (Ignatov et al., 2018) is arguably the first mobile-inference benchmark suite. However, its results focus solely on Android smartphones and only measure latency while providing a summary score that explicitly fails to specify quality targets. (Hadidi et al., 2019) investigates the in-the-edge inference of DNNs from execution time, energy consumption, and temperature perspectives. (Tao et al., 2018) covers configurations with diverse hardware behaviors, such as branch prediction rates and data reuse distances, and evaluates the accuracy, performance, and energy of intelligence processors and hardware platforms. Both of these works are fixated on a different range of devices, such as edge devices and intelligence processors, which is out of the scope of this work.
This paper is available on arxiv under CC 4.0 license.