PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Abstract & Introduction by@bayesianinference

by Bayesian InferenceApril 2nd, 2024

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Minghao Yan, University of Wisconsin-Madison;

(2) Hongyi Wang, Carnegie Mellon University;

(3) Shivaram Venkataraman, [email protected].

- Abstract & Introduction
- Motivation
- Opportunities
- Architecture Overview
- Proble Formulation: Two-Phase Tuning
- Modeling Workload Interference
- Experiments
- Conclusion & References
- A. Hardware Details
- B. Experimental Results
- C. Arithmetic Intensity
- D. Predictor Analysis

As neural networks (NN) are deployed across diverse sectors, their energy demand correspondingly grows. While several prior works have focused on reducing energy consumption during training, the continuous operation of MLpowered systems leads to significant energy use during inference. This paper investigates how the configuration of on-device hardware—elements such as GPU, memory, and CPU frequency, often neglected in prior studies, affects energy consumption for NN inference with regular fine-tuning. We propose PolyThrottle, a solution that optimizes configurations across individual hardware components using Constrained Bayesian Optimization in an energy-conserving manner. Our empirical evaluation uncovers novel facets of the energy-performance equilibrium showing that we can save up to 36 percent of energy for popular models. We also validate that PolyThrottle can quickly converge towards near-optimal settings while satisfying application constraints.

The rapid advancements in neural networks and their deployment across various industries have revolutionized multiple aspects of our lives. However, this sophisticated technology carries a drawback: high energy consumption which poses serious sustainability and environmental challenges (Anderson et al., 2022; Gupta et al., 2022; Cao et al., 2020; Anthony et al., 2020). Emerging applications such as autonomous driving systems and smart home assistants require real-time decision-making capabilities (jet), and as we integrate NNs into an ever-growing number of devices, their collective energy footprint poses a considerable burden to our environment (Wu et al., 2022; Schwartz et al., 2020; Lacoste et al., 2019). Moreover, considering that many devices

operate on battery power, curbing energy consumption not only alleviates environmental concerns but also prolongs battery life, making low-energy NN models highly desirable for numerous use cases.

In prior literature, strategies for reducing energy consumption revolve around designing more efficient neural network architectures (Howard et al., 2017; Tan & Le, 2019), quantization (Kim et al., 2021; Banner et al., 2018; Courbariaux et al., 2015; 2014; Gholami et al., 2021), or optimizing maximum GPU frequency (You et al., 2022; Gu et al., 2023). From our experiments, we make new observations about the tradeoffs between energy consumption, inference latency, and various other hardware configurations. Memory frequency, for example, emerges as a significant contributor to energy consumption (as shown in Figure 3), beyond the commonly investigated relationship between maximum GPU compute frequency and energy consumption. Table 2

shows that even with optimal maximum GPU frequency, we can save up to 25% energy by further tuning memory frequency. In addition, minimum GPU frequency also proves to be of importance in certain cases, as shown in Figure 4 and Table 3.

We also observe that a simple linear relationship falls short of capturing the tradeoff between energy consumption, neural network inference latency, and hardware configurations. The complexity of this tradeoff is illustrated by the Pareto Frontier in Figure 2. This nuanced interplay between energy consumption and latency poses a challenging question: How can we find a near-optimal configuration that closely aligns with this boundary?

Designing an efficient framework to answer the above question is challenging due to the large configuration space, the need to re-tune each model and hardware, and frequent finetuning operations. A naive approach, such as grid search, is inefficient and can take hours to find the optimal solution for a given model and desired batch size on a given hardware. The uncertainty in inference latency, especially at smaller batch sizes (Gujarati et al., 2020), further exacerbates the challenge. Furthermore, given that distinct hardware platforms and NN models display unique energy consumption patterns (Section 3), relying on a universally applicable pre-computed optimal configuration is not feasible. Every deployed device must be equipped to quickly identify

its best configuration tailored to its specific workload. Finally, in production environments, daily fine-tuning is often necessary to adapt to a dynamic external environment and integrate new data (Cai et al., 2019; 2020). This demands a mechanism that can quickly adjust configurations to complete fine-tuning requests in time while ensuring the online inference workloads meet Service Level Objectives (SLOs).

*Figure 1.* Figure illustrating the overall workflow of PolyThrottle. The optimizer first identifies the optimal hardware configuration for a given model. When new data arrives, the inference server handles the inference requests. Upon receiving a fine-tuning request, our performance predictor estimates whether time-sharing inference and fine-tuning workloads would result in SLO violations. Then the predictor searches for feasible adjustments to meet the SLO constraints. If such adjustments are identified, the system implements the changes and schedules fine-tuning requests until completion.

In this paper, we explore the interplay between inference latency, energy consumption, and hardware frequency and propose PolyThrottle as our solution. PolyThrottle takes a holistic approach, optimizing various hardware components and batch sizes concurrently to identify near-optimal hardware configurations under a predefined latency SLO. PolyThrottle complements existing efforts to reduce inference latency, including pruning,quantization, and knowledge distillation. We use Constrained Bayes Optimization with GPU, memory, CPU frequencies, and batch size as features, and latency SLO as a constraint to design an efficient framework that automatically adjusts configurations, enabling convergence towards near-optimal settings. Furthermore, PolyThrottle uses a performance prediction model to schedule fine-tuning operations without disrupting ongoing online inference requests. We integrate PolyThrottle

into Nvidia Triton on Jetson TX2 and Orin and evaluate on state-of-the-art CV and NLP models, including EfficientNet and Bert (Tan & Le, 2019; Devlin et al., 2018).

To summarize, our key contributions include:

1. We examine the influence of hardware components beyond GPUs on energy consumption, delineate new tradeoffs between energy consumption and inference performance, and reveal new possibilities for optimization.

2. We construct an adaptive framework that efficiently finds energy-optimal hardware configurations. To accomplish this, we employ Constrained Bayesian Optimization.

3. We develop a performance model to capture the interaction between inference and fine-tuning processes. We use this model to schedule fine-tuning requests and carry out real-time modifications to meet inference SLOs.

4. We implement and evaluate PolyThrottle on a state-of-theart inference server on Jetson TX2 and Orin. With minimal overheads, PolyThrottle reduces energy consumption per query by up to 36%.

L O A D I N G

. . . comments & more!

. . . comments & more!