This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. : Authors (1) Minghao Yan, University of Wisconsin-Madison; (2) Hongyi Wang, Carnegie Mellon University; (3) Shivaram Venkataraman, myan@cs.wisc.edu. Table of Links Abstract & Introduction Motivation Opportunities Architecture Overview Proble Formulation: Two-Phase Tuning Modeling Workload Interference Experiments Conclusion & References A. Hardware Details B. Experimental Results C. Arithmetic Intensity D. Predictor Analysis 4 ARCHITECTURE OVERVIEW To take advantage of the opportunities described in the previous section, we design PolyThrottle, a system that navigates the tradeoff between latency SLO, batch size, and energy. PolyThrottle optimizes for the most energy-efficient hardware configurations under performance constraints and handles scheduling of on-device fine-tuning. Figure 1 shows a high-level overview of PolyThrottle’s workflow. In a production environment, sensors on the edge devices continuously collect data and send the data to the deployed model for inference. In the meantime, to adapt to a changing environment and data patterns, these data are also saved for fine-tuning later. Due to the limited computation resources on these edge devices, fine-tuning workloads are often scheduled in conjunction with the continuously running inference requests. To address the challenges in model deployment on edge devices, PolyThrottle consists of two key components: 1. An optimization framework that finds optimal hardware configurations for a given model under predetermined SLOs using few samples. 2. A performance predictor and scheduler to dynamically schedule fine-tuning requests and adjust for the optimal hardware configuration while satisfying SLO. PolyThrottle tackles these challenges separately. Offline, we automatically find the best CPU frequency, GPU frequency, memory frequency, and recommended batch size for inference requests that satisfy the latency constraints while minimizing per-query energy consumption. We discuss the details of the optimization procedure in Section 5. We also show that our formulation can find near-optimal energy configurations in a few minutes using just a handful of samples. Compared to the lifespan of long-running inference workloads, the overhead is negligible. The optimal configuration is then installed on the inference server. At runtime, the client program processes the input and sends inference requests to the inference server. Meanwhile, if there are pending fine-tuning requests, the performance predictor predicts the inference latency when running concurrent fine-tuning, and decides whether it is possible to satisfy the latency SLO if fine-tuning is scheduled concurrently. A detailed discussion on performance prediction can be found in Section 6. The scheduler then decides what the new configuration that can satisfy the latency SLO while minimizing per-query energy consumption is. If such a configuration is attainable, it will schedule fine-tuning requests iteration-by-iteration until all pending requests are finished. Adjusting the frequency of each hardware component entails writing to one or multiple hardware configuration files, a process that takes approximately 17ms each. On Jetson TX2 and Orin, each CPU core, GPU, and memory has a separate configuration file that determines Online vs. Offline: operating frequency. As a result, setting the operating frequencies for CPUs, GPU, and memory could require up to 150ms. This duration could exceed the latency SLO for many applications, and this is without accounting for the additional overhead of completing frequency changes. Since the latency SLO for a specific workload does not change frequently, PolyThrottle determines the optimal hardware configuration before deployment and only performs online adjustments to accommodate fine-tuning workloads.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Architecture Overview

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Dynamic Programming Approach to Optimizing Signaling Strategies in Multi-phase Trials:

The Promise Of Edge Computing

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Predictor Analysis

Proble Formulation: Two-Phase Tuning

Modeling Workload Interference

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Hardware Details

A Dynamic Programming Approach to Optimizing Signaling Strategies in Multi-phase Trials:

The Promise Of Edge Computing

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Predictor Analysis

Proble Formulation: Two-Phase Tuning

Modeling Workload Interference

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Hardware Details

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps