This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Minghao Yan, University of Wisconsin-Madison;
(2) Hongyi Wang, Carnegie Mellon University;
(3) Shivaram Venkataraman, [email protected].
We vary the latency SLO to assess how the predictor schedules the fine-tuning requests. We replay a 60-second stream where we initially set the latency SLO to 250ms for the first half (30 seconds), and then increase it to 700ms for the remainder. As shown in Figure 14, under stringent latency conditions, the predictor deduces that it is impractical to schedule fine-tuning requests while adhering to the latency SLO, hence no fine-tuning requests are scheduled. Conversely, when the latency SLO is more relaxed, the predictor determines that it is feasible to schedule fine-tuning requests and sequentially schedules each request once the preceding one is completed and has issued a completion signal.