paint-brush
PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Hardware Detailsby@bayesianinference

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Hardware Details

by Bayesian InferenceApril 2nd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper investigates how the configuration of on-device hardware affects energy consumption for neural network inference with regular fine-tuning.
featured image - PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Hardware Details
Bayesian Inference HackerNoon profile picture

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Minghao Yan, University of Wisconsin-Madison;

(2) Hongyi Wang, Carnegie Mellon University;

(3) Shivaram Venkataraman, [email protected].

A HARDWARE DETAILS

A.1 Jetson platform details

The Jetson TX2 Developer Kit features a 256-core NVIDIA Pascal GPU, a Dual-Core NVIDIA Denver 2 64- bit CPU, a Quad-Core ARM Cortex-A57 MPCore CPU, and 8GB of 128-bit LPDDR4 memory with 59.7 GB/s bandwidth. The kit’s maximum power consumption is 15W. The Jetson Orin Developer Kit includes a 2,048-core NVIDIA Ampere GPU with 64 Tensor Cores and a 12-core Arm CPU. This kit comes with 32GB of 256-bit LPDDR5 memory, featuring a 204.8GB/s bandwidth, and has a maximum power consumption of 60W.

A.2 Power consumption measurement

The Nvidia Jetson TX2 Developer Kit allows for separate measurements of GPU, CPU, DDR, and total energy consumption, while the Jetson Orin uses the built-in tegrastats module for measuring power usage across hardware components. Due to power rail design limitations, GPU power usage on the Jetson Orin can only be measured alongside SoC power usage.


On Jetson TX2, we measure power usage by querying the total power input. We then average the peak power consumption to obtain the power usage during inference. Then we compute the energy cost for each inference request by multiplying the power and the inference time. On Jetson Orin, we leverage the existing tegrastats tool and repeatedly query tegrastats at a fixed interval (50ms). We then sum up each component’s power consumption to obtain the overall power consumption, before multiplying the power and the inference time to obtain the energy cost foreach inference request. To obtain a steady reading, we send 1000 inference requests for each hardware configuration for every model that we test.


We cross-validate our measurements using a USB digital multimeter capable of transmitting data to computer software in real-time via Bluetooth. The measurements obtained from the multimeter generally align with those from the internal power rails on Jetson Kits, although external measurements are consistently around 10% higher than Jetson internal measurements. This discrepancy may be attributable to unaccounted factors in the power rail design. We opted to use internal measurements since they provide component-specific readings, whereas the multimeter can only measure overall energy consumption. Moreover, the multimeter supports one measurement per second, while Jetson tools allow for millisecond-scale measurements, which are better suited to inference workloads.

A.3 Measurement Overhead

Since we repeatedly query the power input or built-in power management tool, we want to understand whether these queries affect total energy consumption. We use a USB digital multimeter capable of transmitting data to computer software in real-time via Bluetooth. We then run our inference program with and without querying the power input or the power management tool. We find that the power consumption reported by the multimeter increases around 5 → 10%, depending on the base power consumption. We observe that this increment is near constant across different models and runs and therefore we believe using internal measurement as described in the section above will not affect our findings. The multimeter cannot provide the precision and flexibility we need to measure the energy cost of inference, which often operates at a millisecond scale.