This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Minghao Yan, University of Wisconsin-Madison;
(2) Hongyi Wang, Carnegie Mellon University;
(3) Shivaram Venkataraman, myan@cs.wisc.edu.
Consider the case where we run an inference workload and aim to support fine-tuning without interfering with the online inference process. When a fine-tuning request arrives, we need to decide if it is possible to execute the fine-tuning request without violating inference SLOs. Time-sharing has been the default method for sharing GPU workloads. In time-sharing, shared workloads use different time slices and alternate GPU use between them. Recently, CUDA streams, Multiprocess Service (MPS) (NVIDIA, 2023a) and MIG (NVIDIA, 2023b) have been proposed to perform space-sharing on GPUs. However, these approaches are not supported on edge GPU devices (Bai et al., 2020; Zhao et al., 2023; Yu & Chowdhury, 2020; Wu et al., 2021). Given this setup, we propose building a performance model that can predict the inference latency in the presence of finetuning requests and only execute fine-tuning requests if the predicted latency can satisfy SLO.
Feature selection: To build the performance model, we leverage the following insights to select features: 1. In convolutional neural networks, the 2D convolution layers’ performance largely determines the overall performance of the network. Its latency is correlated to the number of floating point operations (FLOPs) required during forward / backward propagation. 2. The ratio between the number of FLOPs and the number of memory accesses, also known as arithmetic intensity, together with total FLOPs, encapsulates whether a neural network is compute-bound or memory-bound. Using these insights, we add the following features to our model: Inference FLOPs, Inference Arithmetic Intensity, Fine-tuning FLOPs, Fine-tuning Arithmetic Intensity, and Batchsize.
Model selection: We propose using a linear model to predict inference latency when a fine-tuning workload is running concurrently on the same device. The model aims to capture how the proposed variables affect the resource contention between the inference workload and the finetuning workload, and therefore, affect the inference latency. The proposed model can be summarized as follows:
Given the above performance model, we use a Non-negative Least Squares (NNLS) solver to find the model that best fits the training data. An advantage of NNLS for linear models is that we can solve this with very few training data points (Venkataraman et al., 2016). We collect a few samples on the provided model by varying the inference and fine-tuning batch sizes and the output dimension, which captures various fine-tuning settings. This model is used as part of the workload scheduler during deployment to predict whether it is possible to schedule a fine-tuning request.
Fine-tuning scheduler: During inference, when there are outstanding fine-tuning requests, PolyThrottle uses the model to decide whether it is possible to schedule the request online without violating the SLO. When the model finds a feasible configuration, it adjusts accordingly until either all pending requests are finished or a new latency constraint is
imposed.