This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Minghao Yan, University of Wisconsin-Madison;
(2) Hongyi Wang, Carnegie Mellon University;
(3) Shivaram Venkataraman, [email protected].
In this work, we examine the unique characteristics of energy consumption in neural network inference, especially for edge devices. We identified unique tradeoffs and dimensions between energy consumption and inference latency SLOs and empirically demonstrated hidden components in optimizing energy consumption. We then propose an optimization framework that automatically and holistically tunes various hardware components to find a configuration aligned with the Pareto Frontier. We empirically verify the effectiveness and efficiency of PolyThrottle. PolyThrottle also adapts to the need for fine-tuning and proposes a simple performance prediction model to adaptively schedule finetuning requests while keeping the online inference workload under the inference latency SLO whenever possible. We hope our study sheds more light on the hidden dimension of NN energy optimization.
Jetson partner solutions ebook. URL https://resources.nvidia.com/ en-us-jetson-success-stories/ jetson-partner-solutions-ebook?lx=XRDs_y.
Twitter streaming traces, 2018. URL https://archive.org/details/ archiveteam-twitter-stream-201804.
Alipourfard, O., Liu, H. H., Chen, J., Venkataraman, S., Yu, M., and Zhang, M. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, volume 2, pp. 4–2, 2017.
Anderson, T., Belay, A., Chowdhury, M., Cidon, A., and Zhang, I. Treehouse: A case for carbon-aware datacenter software. 2022.
Anthony, L. F. W., Kanding, B., and Selvan, R. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. 2020.
Arafa, Y., ElWazir, A., ElKanishy, A., Aly, Y., Elsayed, A., Badawy, A., Chennupati, G., Eidenbenz, S., and Santhi, N. Verified instruction-level energy consumption measurement for nvidia gpus. 2020.
Bai, Z., Zhang, Z., Zhu, Y., and Jin, X. Pipeswitch: Fast pipelined context switching for deep learning applications. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, pp. 499–514, 2020.
Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scalable methods for 8-bit training of neural networks. Advances in neural information processing systems, 31, 2018.
Brochu, E., Cora, V. M., and De Freitas, N. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
Cai, H., Wang, T., Wu, Z., Wang, K., Lin, J., and Han, S. On-device image classification with proxyless neural architecture search and quantization-aware fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0, 2019.
Cai, H., Gan, C., Zhu, L., and Han, S. Tinytl: Reduce memory, not parameters for efficient on-device learning. Advances in Neural Information Processing Systems, 33: 11285–11297, 2020.
Cao, Q., Balasubramanian, A., and Balasubramanian, N. Towards accurate and reliable energy measurement of nlp models. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, 2020.
Censor, Y. Pareto optimality in multiobjective problems. Applied Mathematics and Optimization, 4(1):41–59, 1977.
Courbariaux, M., Bengio, Y., and David, J.-P. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.
Courbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015.
Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., and Stoica, I. Clipper: A low latency online prediction serving system. In NSDI, volume 17, pp. 613– 627, 2017.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., and Cunningham, J. P. Bayesian optimization with inequality constraints. In ICML, volume 2014, pp. 937–945, 2014.
Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
Gog, I., Kalra, S., Schafhalter, P., Gonzalez, J. E., and Stoica, I. D3: a dynamic deadline-driven approach for building autonomous vehicles. In Proceedings of the Seventeenth European Conference on Computer Systems, pp. 453–471, 2022.
Gu, D., Xie, X., Huang, G., Jin, X., and Liu, X. Energyefficient gpu clusters scheduling for deep learning. arXiv preprint arXiv:2304.06381, 2023.
Gujarati, A., Karimi, R., Alzayat, S., Hao, W., Kaufmann, A., Vigfusson, Y., and Mace, J. Serving {DNNs} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp. 443–462, 2020.
Gupta, U., Kim, Y. G., Lee, S., Tse, J., Lee, H.-H. S., Wei, G.-Y., Brooks, D., and Wu, C.-J. Chasing carbon: The elusive environmental footprint of computing. IEEE Micro, 42(4):37–47, 2022.
He, C., Li, S., So, J., Zeng, X., Zhang, M., Wang, H., Wang, X., Vepakomma, P., Singh, A., Qiu, H., et al. Fedml: A research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518, 2020.
Hodak, M., Gorkovenko, M., and Dholakia, A. Towards power efficiency in deep learning on data center hardware. In IEEE International Conference on Big Data, 2019.
Hong, S. and Kim, H. An integrated gpu power and performance model. In ISCA, 2010.
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al. Searching for mobilenetv3. pp. 1314–1324, 2019.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Ivanov, A., Dryden, N., Ben-Nun, T., Li, S., and Hoefler, T. Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems, 3:711–732, 2021.
Kandasamy, K., Vysyaraju, K. R., Neiswanger, W., Paria, B., Collins, C. R., Schneider, J., Poczos, B., and Xing, E. P. Tuning hyperparameters without grad students: Scalable and robust bayesian optimisation with dragonfly. The Journal of Machine Learning Research, 21(1):3098–3124, 2020.
Kandiah, V., Peverelle, S., Khairy, M., Pan, J., Manjunath, A., Rogers, T. G., Aamodt, T. M., and Hardavellas, N. Accelwattch: A power modeling framework for modern gpus. In MICRO, 2021.
Kang, D.-K., Lee, K.-B., and Kim, Y.-C. Cost efficient gpu cluster management for training and inference of deep learning. Energies, 15(2):474, 2022.
Kim, S., Gholami, A., Yao, Z., Mahoney, M. W., and Keutzer, K. I-bert: Integer-only bert quantization. pp. 5506–5518. PMLR, 2021.
Klein, A., Bartels, S., Falkner, S., Hennig, P., and Hutter, F. Towards efficient bayesian optimization for big data. In NIPS 2015 Bayesian Optimization Workshop, 2015.
Komoda, T., Hayashi, S., Nakada, T., Miwa, S., and Nakamura, H. Power capping of cpu-gpu heterogeneous systems through coordinating dvfs and task mapping. In 2013 IEEE 31st International Conference on computer design (ICCD). IEEE, 2013.
Lacoste, A., Luccioni, A., Schmidt, V., and Dandres, T. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
Lane, N. D. and Georgiev, P. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th international workshop on mobile computing systems and applications, pp. 117–122, 2015.
Lee, J., Chirkov, N., Ignasheva, E., Pisarchyk, Y., Shieh, M., Riccardi, F., Sarokin, R., Kulik, A., and Grundmann, M. On-device neural net inference with mobile gpus. arXiv preprint arXiv:1907.01989, 2019.
Lowe-Power, J., Ahmad, A. M., Akram, A., Alian, M., Amslinger, R., Andreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al. The gem5 simulator: Version 20.0+. arXiv preprint arXiv:2007.03152, 2020.
Mei, X., Wang, Q., and Chu, X. A survey and measurement study of gpu dvfs on energy conservation. Digital Communications and Networks, 3(2):89–100, 2017.
Nabavinejad, S. M., Reda, S., and Ebrahimi, M. Batchsizer: Power-performance tradeoff for dnn inference. In Proceedings of the 26th Asia and South Pacific Design Automation Conference, 2021.
NVIDIA. Stream management, 2023a. URL https://docs.nvidia.com/cuda/ cuda-runtime-api.
NVIDIA. Nvidia multi-instance gpu, 2023b. URL https://docs.nvidia.com/datacenter/ tesla/mig-user-guide/index.html.
Peng, Y., Zhu, Y., Chen, Y., Bao, Y., Yi, B., Lan, C., Wu, C., and Guo, C. A generic communication scheduler for distributed dnn training acceleration. In SOSP, 2019.
Qiao, A., Choe, S. K., Subramanya, S. J., Neiswanger, W., Ho, Q., Zhang, H., Ganger, G. R., and Xing, E. P. Pollux: Coadaptive cluster scheduling for goodput-optimized deep learning. In OSDI, 2021.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
Romero, F., Li, Q., Yadwadkar, N. J., and Kozyrakis, C. {INFaaS}: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pp. 397–411, 2021.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green ai. Commun. ACM, 63(12):54–63, 2020.
Shen, H., Chen, L., Jin, Y., Zhao, L., Kong, B., Philipose, M., Krishnamurthy, A., and Sundaram, R. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 322–337, 2019.
Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
Tambe, T., Hooper, C., Pentecost, L., Jia, T., Yang, E.- Y., Donato, M., Sanh, V., Whatmough, P., Rush, A. M., Brooks, D., et al. Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference. In MICRO, 2021.
Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. pp.6105–6114. PMLR, 2019.
Tang, Z., Wang, Y., Wang, Q., and Chu, X. The impact of gpu dvfs on the energy and performance of deep learning: An empirical study. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, 2019.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jegou, H. Training data-efficient image transform- ´ ers & distillation through attention. pp. 10347–10357. PMLR, 2021.
Venkataraman, S., Yang, Z., Franklin, M., Recht, B., and Stoica, I. Ernest: Efficient performance prediction for {Large-Scale} advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pp. 363–378, 2016.
Wan, C., Santriaji, M., Rogers, E., Hoffmann, H., Maire, M., and Lu, S. Alert: Accurate learning for energy and timeliness. In ATC, 2020.
Wang, F., Zhang, W., Lai, S., Hao, M., and Wang, Z. Dynamic gpu energy optimization for machine learning training workloads. IEEE Transactions on Parallel and Distributed Systems, 2021.
Wang, G., Venkataraman, S., Phanishayee, A., Devanur, N., Thelin, J., and Stoica, I. Blink: Fast and generic collectives for distributed ml. In Proceedings of Machine Learning and Systems, 2020a.
Wang, Y., Wang, Q., Shi, S., He, X., Tang, Z., Zhao, K., and Chu, X. Benchmarking the performance and energy efficiency of ai accelerators for ai training. In 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2020b.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. Transformers: State-of-the-art natural language processing. In EMNLP, 2020.
Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., Gschwind, M., Gupta, A., Ott, M., Melnikov, A., Candido, S., Brooks, D., Chauhan, G., Lee, B.,Lee, H.-H., Akyildiz, B., Balandat, M., Spisak, J., Jain, R., Rabbat, M., and Hazelwood, K. Sustainable ai: Environmental implications, challenges and opportunities. In Proceedings of Machine Learning and Systems, 2022.
Wu, X., Rao, J., Chen, W., Huang, H., Ding, C., and Huang, H. Switchflow: preemptive multitasking for deep learning. In Proceedings of the 22nd Inter
Xu, M., Liu, J., Liu, Y., Lin, F. X., Liu, Y., and Liu, X. A first look at deep learning apps on smartphones. In The World Wide Web Conference, WWW ’19, pp. 2125–2136, 2019.
You, J., Chung, J.-W., and Chowdhury, M. Zeus: Understanding and optimizing gpu energy consumption of dnn training. arXiv preprint arXiv:2208.06102, 2022.
Yu, P. and Chowdhury, M. Fine-grained gpu sharing primitives for deep learning applications. Proceedings of Machine Learning and Systems, 2:98–111, 2020.
Zhao, Y., Liu, X., Liu, S., Li, X., Zhu, Y., Huang, G., Liu, X., and Jin, X. Muxflow: Efficient and safe gpu sharing in large-scale production deep learning clusters. arXiv preprint arXiv:2303.13803, 2023.