This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors: (1) Minghao Yan, University of Wisconsin-Madison; (2) Hongyi Wang, Carnegie Mellon University; (3) Shivaram Venkataraman, myan@cs.wisc.edu. Table of Links Abstract & Introduction Motivation Opportunities Architecture Overview Proble Formulation: Two-Phase Tuning Modeling Workload Interference Experiments Conclusion & References A. Hardware Details B. Experimental Results C. Arithmetic Intensity D. Predictor Analysis 6 MODELING WORKLOAD INTERFERENCE Consider the case where we run an inference workload and aim to support fine-tuning without interfering with the online inference process. When a fine-tuning request arrives, we need to decide if it is possible to execute the fine-tuning request without violating inference SLOs. Time-sharing has been the default method for sharing GPU workloads. In time-sharing, shared workloads use different time slices and alternate GPU use between them. Recently, CUDA streams, Multiprocess Service (MPS) (NVIDIA, 2023a) and MIG (NVIDIA, 2023b) have been proposed to perform space-sharing on GPUs. However, these approaches are not supported on edge GPU devices (Bai et al., 2020; Zhao et al., 2023; Yu & Chowdhury, 2020; Wu et al., 2021). Given this setup, we propose building a performance model that can predict the inference latency in the presence of finetuning requests and only execute fine-tuning requests if the predicted latency can satisfy SLO. To build the performance model, we leverage the following insights to select features: 1. In convolutional neural networks, the 2D convolution layers’ performance largely determines the overall performance of the network. Its latency is correlated to the number of floating point operations (FLOPs) required during forward / backward propagation. 2. The ratio between the number of FLOPs and the number of memory accesses, also known as arithmetic intensity, together with total FLOPs, encapsulates whether a neural network is compute-bound or memory-bound. Using these insights, we add the following features to our model: Inference FLOPs, Inference Arithmetic Intensity, Fine-tuning FLOPs, Fine-tuning Arithmetic Intensity, and Batchsize. Feature selection: We propose using a linear model to predict inference latency when a fine-tuning workload is running concurrently on the same device. The model aims to capture how the proposed variables affect the resource contention between the inference workload and the finetuning workload, and therefore, affect the inference latency. The proposed model can be summarized as follows: Model selection: Given the above performance model, we use a Non-negative Least Squares (NNLS) solver to find the model that best fits the training data. An advantage of NNLS for linear models is that we can solve this with very few training data points (Venkataraman et al., 2016). We collect a few samples on the provided model by varying the inference and fine-tuning batch sizes and the output dimension, which captures various fine-tuning settings. This model is used as part of the workload scheduler during deployment to predict whether it is possible to schedule a fine-tuning request. During inference, when there are outstanding fine-tuning requests, PolyThrottle uses the model to decide whether it is possible to schedule the request online without violating the SLO. When the model finds a feasible configuration, it adjusts accordingly until either all pending requests are finished or a new latency constraint is Fine-tuning scheduler: imposed.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Modeling Workload Interference

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Dynamic Programming Approach to Optimizing Signaling Strategies in Multi-phase Trials:

The Promise Of Edge Computing

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Predictor Analysis

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Architecture Overview

Proble Formulation: Two-Phase Tuning

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Hardware Details

A Dynamic Programming Approach to Optimizing Signaling Strategies in Multi-phase Trials:

The Promise Of Edge Computing

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Predictor Analysis

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Architecture Overview

Proble Formulation: Two-Phase Tuning

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices: Hardware Details

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps