paint-brush
PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Architecture Overviewby@bayesianinference

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Architecture Overview

by Bayesian Inference
Bayesian Inference HackerNoon profile picture

Bayesian Inference

@bayesianinference

At BayesianInference.Tech, as more evidence becomes available, we make predictions...

April 2nd, 2024
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper investigates how the configuration of on-device hardware affects energy consumption for neural network inference with regular fine-tuning.
featured image - PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Architecture Overview
1x
Read by Dr. One voice-avatar

Listen to this story

Bayesian Inference HackerNoon profile picture
Bayesian Inference

Bayesian Inference

@bayesianinference

At BayesianInference.Tech, as more evidence becomes available, we make predictions and refine beliefs.

Learn More
LEARN MORE ABOUT @BAYESIANINFERENCE'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Minghao Yan, University of Wisconsin-Madison;

(2) Hongyi Wang, Carnegie Mellon University;

(3) Shivaram Venkataraman, myan@cs.wisc.edu.

4 ARCHITECTURE OVERVIEW

To take advantage of the opportunities described in the previous section, we design PolyThrottle, a system that navigates the tradeoff between latency SLO, batch size, and

energy. PolyThrottle optimizes for the most energy-efficient

hardware configurations under performance constraints and

handles scheduling of on-device fine-tuning.


Figure 1 shows a high-level overview of PolyThrottle’s workflow. In a production environment, sensors on the edge devices continuously collect data and send the data to the deployed model for inference. In the meantime, to adapt to a changing environment and data patterns, these data are also saved for fine-tuning later. Due to the limited computation resources on these edge devices, fine-tuning workloads are often scheduled in conjunction with the continuously

running inference requests. To address the challenges in model deployment on edge devices, PolyThrottle consists of two key components:


1. An optimization framework that finds optimal hardware configurations for a given model under predetermined SLOs using few samples.


2. A performance predictor and scheduler to dynamically schedule fine-tuning requests and adjust for the optimal hardware configuration while satisfying SLO.


PolyThrottle tackles these challenges separately. Offline, we automatically find the best CPU frequency, GPU frequency, memory frequency, and recommended batch size for inference requests that satisfy the latency constraints while minimizing per-query energy consumption. We discuss the details of the optimization procedure in Section 5. We also show that our formulation can find near-optimal energy configurations in a few minutes using just a handful of samples. Compared to the lifespan of long-running inference workloads, the overhead is negligible.


The optimal configuration is then installed on the inference server. At runtime, the client program processes the input and sends inference requests to the inference server. Meanwhile, if there are pending fine-tuning requests, the performance predictor predicts the inference latency when

running concurrent fine-tuning, and decides whether it is possible to satisfy the latency SLO if fine-tuning is scheduled concurrently. A detailed discussion on performance prediction can be found in Section 6. The scheduler then decides what the new configuration that can satisfy the latency SLO while minimizing per-query energy consumption is. If such a configuration is attainable, it will schedule fine-tuning requests iteration-by-iteration until all pending requests are finished.


Online vs. Offline: Adjusting the frequency of each hardware component entails writing to one or multiple hardware configuration files, a process that takes approximately 17ms each. On Jetson TX2 and Orin, each CPU core, GPU, and memory has a separate configuration file that determines

operating frequency. As a result, setting the operating frequencies for CPUs, GPU, and memory could require up to 150ms. This duration could exceed the latency SLO for many applications, and this is without accounting for the additional overhead of completing frequency changes. Since

the latency SLO for a specific workload does not change frequently, PolyThrottle determines the optimal hardware configuration before deployment and only performs online adjustments to accommodate fine-tuning workloads.

L O A D I N G
. . . comments & more!

About Author

Bayesian Inference HackerNoon profile picture
Bayesian Inference@bayesianinference
At BayesianInference.Tech, as more evidence becomes available, we make predictions and refine beliefs.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Coffee-web
X REMOVE AD