This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors: (1) Minghao Yan, University of Wisconsin-Madison; (2) Hongyi Wang, Carnegie Mellon University; (3) Shivaram Venkataraman, myan@cs.wisc.edu. Table of Links Abstract & Introduction Motivation Opportunities Architecture Overview Proble Formulation: Two-Phase Tuning Modeling Workload Interference Experiments Conclusion & References A. Hardware Details B. Experimental Results C. Arithmetic Intensity D. Predictor Analysis 3 OPPORTUNITIES In this section, we perform empirical experiments to uncover new opportunities for optimizing energy use in NN inference. As discussed in Section 2, prior work did not study how memory frequency, minimum GPU frequency, and CPU frequency play a role in energy consumption. This is partially limited by hardware constraints. Specialized power rails need to be built into the device during manufacturing to enable accurate measurement of energy consumption associated with each component. We leverage two Jetson developer kits, TX2 and Orin, which offer native support for component-wise energy consumption measurement and frequency tuning, to study how these frequencies impact inference latency and energy consumption in modern deep learning workloads. We find that the default frequencies are much larger than optimal and throttling all these frequency knobs offer energy consumption reduction with minimal impact on inference latency. Figure 3 illustrates the energy optimization landscape when varying GPU and memory frequencies, without imposing any constraints on latency SLO. The plot reveals that, without any other constraints, the energy optimization landscape generally exhibits a bowl shape. However, this shape varies depending on the models, devices, and other hyperparameters, such as batch sizes (See Appendix B for more results). Next, we dive into how each hardware component affects inference energy consumption. : Experiments in this section are performed with 16-bit floating point number precision, as it has been demonstrated to have minimal impact on model accuracy in practice. We use Bert and EfficientNet models and vary the EfficientNet model size between B0, B4, B7 (Table 4). Setup For each model, we fix the GPU frequency at the optimal frequency determined by grid-search of all possible frequency configurations. We then examine the tradeoff between inference latency and energy consumption as we progressively throttle memory frequency. The range of available memory frequencies can be found in Table 1. Memory frequency experiment: : Table 2 reveals that memory frequency plays a vital role in reducing energy consumption. The savings provided by memory frequency tuning are similar and consistent across models on both hardware platforms, ranging from approximately 12% to 25%. This indicates that the Results default memory frequency is higher than optimal for modern Deep Learning workloads. For heavy workloads such as Bert, memory tuning can account for the majority of the energy consumption reduction. This can be partially attributed to the memory-bound nature of Transformer-based models (Ivanov et al., 2021). Our result demonstrates that systems that aim to optimize energy use in neural network inference need to take memory frequency into account. CPUs are only used for data pre-processing. Thus, we first measure the time spent in the data processing part of the inference pipeline. Next, we measure the energy saved by throttling the CPU frequency and assess the inference latency slowdown caused by reducing CPU frequency. The data preprocessing we perform is standard in almost all image processing and object detection pipelines, where we read the raw image file, convert it to CPU Frequency Experiment: an RGB scale, resize it, and reorient it to the desired input resolution and data layout. : The preprocessing time across different EfficientNet models remains constant since the operations performed are identical. As a result, the relative impact of CPU tuning on overall energy consumption depends on the ratio between preprocessing time and inference time. As the model size increases and inference duration increases, the influence of CPU tuning on overall energy consumption decreases. We observe that on both Jetson TX2 and Orin platforms, CPU tuning can decrease preprocessing energy consumption by approximately 30%. Depending on the model, quantization level, and batch size, this results in up to a 6% reduction in overall energy consumption. Results We maintain the default hardware configuration and only adjust the minimum GPU frequency on Jetson Orin. Increasing the minimum GPU frequency forces the GPU DVFS mechanism to operate within a smaller range. We scale the model from Minimum GPU frequency experiment: EfficientNet B0 to EfficientNet B7 to illustrate the effect of the GPU minimum frequency on inference latency. : Table 3 indicates that tuning the minimum GPU frequency can significantly reduce energy consumption when the workload cannot fully utilize the computational power of the hardware. Notably, energy consumption and inference latency are reduced by forcing the GPU to operate Results at a higher frequency. This differs from the tradeoff observed in other experiments, where we exchange inference latency for lower energy consumption. Tuning minimum GPU frequency can nearly halve the energy consumption for small models. As computational power becomes saturated with increasing model size, the return on tuning the minimum GPU frequency diminishes. Figure 4 shows the per query energy cost as we vary the minimum and maximum GPU frequency. It shows that increasing the minimum GPU frequency from the default minimum leads to lower energy costs and inference latency.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Opportunities

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

A Dynamic Programming Approach to Optimizing Signaling Strategies in Multi-phase Trials:

The Promise Of Edge Computing

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Predictor Analysis

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Architecture Overview

Proble Formulation: Two-Phase Tuning

A Dynamic Programming Approach to Optimizing Signaling Strategies in Multi-phase Trials:

The Promise Of Edge Computing

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Predictor Analysis

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Architecture Overview

Proble Formulation: Two-Phase Tuning

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps