paint-brush
Modeling Workload Interferenceby@bayesianinference

Modeling Workload Interference

by Bayesian Inference
Bayesian Inference HackerNoon profile picture

Bayesian Inference

@bayesianinference

At BayesianInference.Tech, as more evidence becomes available, we make predictions...

April 2nd, 2024
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper investigates how the configuration of on-device hardware affects energy consumption for neural network inference with regular fine-tuning.
featured image - Modeling Workload Interference
1x
Read by Dr. One voice-avatar

Listen to this story

Bayesian Inference HackerNoon profile picture
Bayesian Inference

Bayesian Inference

@bayesianinference

At BayesianInference.Tech, as more evidence becomes available, we make predictions and refine beliefs.

Learn More
LEARN MORE ABOUT @BAYESIANINFERENCE'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Minghao Yan, University of Wisconsin-Madison;

(2) Hongyi Wang, Carnegie Mellon University;

(3) Shivaram Venkataraman, myan@cs.wisc.edu.

6 MODELING WORKLOAD INTERFERENCE

Consider the case where we run an inference workload and aim to support fine-tuning without interfering with the online inference process. When a fine-tuning request arrives, we need to decide if it is possible to execute the fine-tuning request without violating inference SLOs. Time-sharing has been the default method for sharing GPU workloads. In time-sharing, shared workloads use different time slices and alternate GPU use between them. Recently, CUDA streams, Multiprocess Service (MPS) (NVIDIA, 2023a) and MIG (NVIDIA, 2023b) have been proposed to perform space-sharing on GPUs. However, these approaches are not supported on edge GPU devices (Bai et al., 2020; Zhao et al., 2023; Yu & Chowdhury, 2020; Wu et al., 2021). Given this setup, we propose building a performance model that can predict the inference latency in the presence of finetuning requests and only execute fine-tuning requests if the predicted latency can satisfy SLO.


Feature selection: To build the performance model, we leverage the following insights to select features: 1. In convolutional neural networks, the 2D convolution layers’ performance largely determines the overall performance of the network. Its latency is correlated to the number of floating point operations (FLOPs) required during forward / backward propagation. 2. The ratio between the number of FLOPs and the number of memory accesses, also known as arithmetic intensity, together with total FLOPs, encapsulates whether a neural network is compute-bound or memory-bound. Using these insights, we add the following features to our model: Inference FLOPs, Inference Arithmetic Intensity, Fine-tuning FLOPs, Fine-tuning Arithmetic Intensity, and Batchsize.


Model selection: We propose using a linear model to predict inference latency when a fine-tuning workload is running concurrently on the same device. The model aims to capture how the proposed variables affect the resource contention between the inference workload and the finetuning workload, and therefore, affect the inference latency. The proposed model can be summarized as follows:


image


Given the above performance model, we use a Non-negative Least Squares (NNLS) solver to find the model that best fits the training data. An advantage of NNLS for linear models is that we can solve this with very few training data points (Venkataraman et al., 2016). We collect a few samples on the provided model by varying the inference and fine-tuning batch sizes and the output dimension, which captures various fine-tuning settings. This model is used as part of the workload scheduler during deployment to predict whether it is possible to schedule a fine-tuning request.


image


Fine-tuning scheduler: During inference, when there are outstanding fine-tuning requests, PolyThrottle uses the model to decide whether it is possible to schedule the request online without violating the SLO. When the model finds a feasible configuration, it adjusts accordingly until either all pending requests are finished or a new latency constraint is

imposed.

L O A D I N G
. . . comments & more!

About Author

Bayesian Inference HackerNoon profile picture
Bayesian Inference@bayesianinference
At BayesianInference.Tech, as more evidence becomes available, we make predictions and refine beliefs.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Coffee-web
X REMOVE AD