This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Minghao Yan, University of Wisconsin-Madison;
(2) Hongyi Wang, Carnegie Mellon University;
(3) Shivaram Venkataraman, [email protected].
Hardware Platform: Our experiments are conducted on the Jetson TX2 Developer Kit and Jetson Orin Developer Kit. To assess the energy consumption of our program, we employ the built-in power monitors on the Jetson TX2 and Jetson Orin Developer Kits. We also cross-validate our measurements with an external digital multimeter (See Appendix A for more details on hardware and energy measurement).
Workload Selection: We base our experiments on the EfficientNet family and Bert models (Tan & Le, 2019; Devlin et al., 2018). EfficientNet is chosen not only for its status as a state-of-the-art convolutional network in on-device and mobile settings but also for its principled approach to scaling the width, depth, and resolution of convolution layers. Table 4 summarizes the scaling pattern of EfficientNet from the smallest B0 to the largest B7. We select Bert to investigate energy usage patterns in a Transformer-based model (Wolf et al., 2020), where the workload is more memory-bounded compared to convolution-based neural networks. Bert and its variants (Kim et al., 2021; Devlin et al., 2018; Sanh et al., 2019; Tambe et al., 2021) are widely used for Question and Answering tasks (Rajpurkar et al., 2016), making it applicable for numerous edge devices, such as smart home assistants and smart speakers.
Dataset: We evaluate PolyThrottle on real-world traffic streams data (Shen et al., 2019) and sample frames uniformly to feed into EfficientNet. For Bert, we evaluate on SQuAD (Rajpurkar et al., 2016) for Question Answering. Note that datasets do not affect PolyThrottle’s performance since inference latency would not change significantly across datasets once the model is chosen.
Implementation: PolyThrottle is built on the Nvidia Triton inference server. To maximize performance, we generate TensorRT kernels that profile various data layouts and tiling strategies to identify the fastest execution graph for a given hardware platform. Our modules include a Bayesian optimizer for determining the best configuration, an inference client responsible for preprocessing and submitting requests to the inference server, and a performance predictor module integrated into the inference client for scheduling fine-tuning requests. We maintain separate queues for inference and fine-tuning requests.
In this experiment, we carry out an extensive empirical analysis of tuning various models across different hardware configurations while also adjusting the quantization level. We perform a grid search on EfficientNet B0, B4, B7, and Bert Base to examine the potential energy savings and identify the optimal GPU and memory frequencies for each model. We also adjust the quantization level for each tested model. We evaluate 16-bit and 32-bit floating point (FP16/FP32) precision. The optimal energy consumption and configuration referenced later in this section use the results obtained here as the baseline and optimal solution. Having obtained the optimal frequency using grid search, we next evaluate the average number of attempts it takes for PolyThrottle to find a solution within 5% of the optimal solution. We compare our Constrained Bayesian Optimization (CBO) formulation against Random Search (RS).
Experiment Settings: We measure the average number of attempts needed to find a near-optimal configuration. For Random Search, we calculate the expected number of trials needed to find a near-optimal configuration based on the grid size by computing the fraction of near-optimal configurations and taking the reciprocal. For CBO, we set the ξ parameter associated with the Expected Improvement function to 0.1 and initial random samples to be 5, which we find to work well across different models and hardware platforms. We conduct two experiments where we set different inference latency constraints, the results can be found in Figure 5:
1. We restrict inference latency to close to the optimal latency (20%). In this setting, the tight latency constraints make it impossible to batch the inference query, essentially reducing the search space for the optimal configuration.
2. In the second benchmark, we relax the inference latency constraint to include the configurations that provide the lowest energy-per-query in Figure 2. In this setting, we need to explore the batch size dimension to find the configuration that minimizes energy. We test on EfficientNet B0, B4, and B7, as well as Bert Base on both Jetson TX2 and Jetson Orin.
Results: Figure 5 shows that CBO outperforms RS in both scenarios. Since CBO models the relationship between hardware configuration and latency, it can find a near-optimal solution with only 5 to 15 samples. In the second scenario, the performance of RS deteriorates as it is unable to leverage the relationship between latency and batch size when dealing with a multiplicatively increasing search space. Overall CBO takes 3-10x fewer samples in the second setting. The overhead of performing CBO is also minimal. As shown in Figure 5, CBO only requires around 15 samples to find a near-optimal solution and the optimization procedure can be completed in a few minutes. In cases where a new model is deployed, only a few minutes of overhead are needed to find optimal configurations for the new model.
It is important to note that though RS might achieve performance comparable to CBO under certain conditions, this result is merely the expected value and the variance of RS is large. For instance, if 10 out of 200 configurations are near-optimal, the expected number of trials needed to reach a near-optimal configuration is 20, with a standard deviation of 19.49. Consequently, it’s plausible that even after 40 trials, RS might still fail to identify a near-optimal configuration. On the other hand, the standard deviation of CBO is smaller; in all experiments, CBO’s standard deviations are less than 3.
Next, we evaluate how well PolyThrottle handles fine-tuning requests alongside inference. The central question we aim to address is whether our performance predictor can effectively identify and adjust accordingly when the SLO requirement is at risk of being violated, and if reducing the inference batch size and trading throughput can satisfy the latency SLO. To simulate this scenario, we generate two distinct inference arrival patterns (Uniform and Poisson) and use the publicly available Twitter trace (twi, 2018) and compare our adaptive scheduling approach to greedy scheduling, where a fine-tuning request is scheduled as soon as it arrives. The three arrival patterns represent scenarios that are highly controlled and bursty, respectively. In this context, we contrast PolyThrottle’s adaptive scheduling mechanism with the greedy scheduling approach to assess the efficacy of PolyThrottle in meeting the desired SLO requirement.
Experiment Settings: We evaluate on both synthetic and real workloads. For synthetic workloads, we generate a stream of inference requests using both Uniform and Poisson distributions. For real world workload, we first uniformly sample a day of Twitter streaming traces and then compute the variance of requests during each minute. We then picked the segment with the highest variance to test PolyThrottle’s capability in handling request bursts (twi, 2018; Romero et al., 2021). On Jetson Orin, we replay the stream for 30 seconds and measure the SLO violation rate during the replay using EfficientNet B7. Since each burst only lasts for a few seconds, this suffices to capture many bursts in the workload. We find that running the experiment for longer durations produces similar results. We set the fine-tuning batch size to 64, the number of fine-tuning iterations to 10, SLO to 0.7s, the output dimension to 1000, and an average of 8 inference requests per second. On Jetson TX2, we do the same experiment on EfficientNet B4. Due to memory constraints, we perform the fine-tuning batch size to 8, SLO to 1s, the output dimension to 100, and an average of 4 inference requests per second. We select a less performative model on TX2 to meet a reasonable SLO target (under 1s). The number of fine-tuning iterations is chosen based on the duration of the replay. We then measure the energy costs when deploying PolyThrottle at the default and optimal hardware frequency, respectively, to measure how much energy we save during this period. The optimal hardware frequency is obtained from results in Section 7.2.
For greedy scheduling, we employ a standard drop policy (Crankshaw et al., 2017; Shen et al., 2019), whereby a request is dropped if it has already exceeded its deadline. In the adaptive setting, we use the predictor to determine whether to drop an inference request. We also replay the inference request stream without fine-tuning requests to serve as a baseline.
Results: Table 5 and 6 show the SLO violation rates under various workloads and latency targets. The findings indicate that greedy scheduling may lead to significant SLO violations owing to the interference introduced by the fine-tuning workload. In contrast, PolyThrottle’s adaptive scheduling mechanism demonstrates the ability to achieve low SLO violation rates by dynamically adjusting configurations. The baseline figures in the table represent SLO violation rates in the absence of interference from fine-tuning requests.
Inherent variance in neural network inference resulted in 1% of SLO violations in the case of Uniform distribution. However, bursts in the Poisson distribution and the Twitter workload generated more SLO violations. PolyThrottle’s adaptive scheduling mechanism significantly reduces the SLO violation rate, meeting the SLO requirements while concurrently handling fine-tuning requests. Nevertheless, in several instances, we were unable to achieve near-zero SLO violation rates. This limitation can be attributed to the granularity of scheduling as we process the current batch of requests over an extended timespan due to interference from the fine-tuning workload.
We also reduce energy consumption by 14% on EfficientNet B7 on Jetson Orin and by 23% on EfficientNet B4 on Jetson TX2 across the workloads. We show in Appendix D how PolyThrottle reacts to changing SLOs when there are outstanding fine-tuning requests.