Table of Links Abstract and 1. Introduction Abstract and 1. Introduction Background and Motivation
PowerInfer-2 Overview
Neuron-Aware Runtime Inference
Execution Plan Generation
Implementation
Evaluation
Related Work
Conclusion and References Background and Motivation Background and Motivation PowerInfer-2 Overview PowerInfer-2 Overview Neuron-Aware Runtime Inference Neuron-Aware Runtime Inference Execution Plan Generation Execution Plan Generation Implementation Implementation Evaluation Evaluation Related Work Related Work Conclusion and References Conclusion and References 8 Related Work Resource-Efficient LLM. Deploying LLMs on resourcer-estricted devices has become more and more popular [37]. A representative framework is MLC-LLM [33], which enables native deployment of many large language models on mobile devices with GPU acceleration. However, it is limited to in-memory computation scenarios, and fails to run when the model is too large to fit in memory. There are also other approaches, such as network pruning [15, 22], knowledge distillation [16], and quantization [8, 20] to reduce model memory footprints. These approaches are orthogonal to PowerInfer-2, and can be used together with PowerInfer-2 to further improve the efficiency of deploying LLM on mobile devices. Resource-Efficient LLM Speculative Decoding. Speculative decoding can also be utilized to enhance inference speed [7,12,18]. This technique uses a smaller model (e.g., 1B parameters) to quickly generate multiple candidate tokens and then validates them with a larger model (e.g., 13B parameters) in a batch. Only tokens accepted by the larger model will be displayed to users. By verifying multiple tokens at a time, SpecInfer [23] can reduce the number of decoding steps. In the offloading scenario, however, the large amount of I/O loading from flash storages becomes the bottleneck of speculative decoding, especially for MoE models, which requires to load all experts for one batch, losing the benefit of sparse activation of experts. Speculative Decoding. 9 Conclusion This paper introduces PowerInfer-2, a framework that supports high-speed inference of LLMs on smartphones, especially for models exceeding the device’s memory capacity. The key insight of PowerInfer-2 is the use of heterogeneous smartphone resources to adapt matrix computations into more manageable neuron cluster computations. Evaluation on two smartphones demonstrates that PowerInfer-2 achieves up to 29.2× speedup over existing SOTA systems and is the first inference framework that manages to run extremely large language models like the TurboSparse-Mixtral-47B model on a smartphone efficiently. References [1] Multitasking the Android way. https://androiddevelopers.googleblog.com/2010/04/multitaskingandroid-way.html, 2010. [2] Ollama: Get up and running with large language models locally. https://github.com/ollama/ollama, 2024. [3] Abien Fred Agarap. Deep learning using rectified linear units (ReLU), 2019. [4] Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. LLM in a Flash: Efficient large language model inference with limited memory, 2024. [5] Android. https://www.android.com/, 2024. [6] Anthropic. https://www.anthropic.com/news/claude-3- family, 2024. [7] Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. arXiv preprint arXiv: 2401.10774, 2024. [8] Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. QuIP: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36, 2024. [9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021. [10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. [11] Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008. [12] Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding, 2024. [13] Georgi Gerganov. ggerganov/llama.cpp: Port of Facebook’s LLaMA model in C/C++. https://github.com/ ggerganov/llama.cpp, 2024. [14] iOS. https://www.apple.com/ios/ios-17/, 2024. [15] Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang, et al. The emergence of essential sparsity in large pre-trained models: The weights that matter. Advances in Neural Information Processing Systems, 36, 2024. [16] Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, and Yejin Choi. Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing. arXiv preprint arXiv:2305.16635, 2023. [17] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. [18] Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. [19] Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, and Sanjiv Kumar. The lazy neuron phenomenon: On emergence of activation sparsity in transformers, 2023. [20] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. arXiv preprint arXiv:2306.00978, 2023. [21] Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja Vu: Contextual sparsity for efficient LLMs at inference time, 2023. [22] Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720, 2023. [23] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. SpecInfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023. [24] NVIDIA. https://www.nvidia.com/en-us/data-center/ h100/, 2024. [25] OnePlus. https://www.oneplus.com/global, 2024. [26] OpenAI. https://openai.com/gpt-4, 2023. [27] Konstantinos I Roumeliotis, Nikolaos D Tselikas, and Dimitrios K Nasiopoulos. Llama 2: Early adopters’ utilization of Meta’s new open-source pretrained model. 2023. [28] Noam Shazeer. GLU variants improve transformer, 2020. [29] Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, and Maosong Sun. ProSparse: Introducing and enhancing intrinsic activation sparsity within large language models. arXiv preprint arXiv:2402.13516, 2024. [30] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. PowerInfer: Fast large language model serving with a consumer-grade GPU. arXiv preprint arXiv:2312.12456, 2023. [31] Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and Haibo Chen. Turbo sparse: Achieving llm sota performance with minimal activated parameters. arXiv preprint arXiv:2406.05955, 2024. [32] Google: Get started with Gemini Nano on Android (on device). https://ai.google.dev/gemini-api/docs/getstarted/android_aicore, 2024. [33] MLC team. MLC-LLM. https://github.com/mlc-ai/ mlc-llm, 2024. [34] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. [35] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of LM alignment, 2023. [36] Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhu Chen, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746, 2023. [37] Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, et al. A survey of resource-efficient LLM and multimodal foundation models. arXiv preprint arXiv:2401.08092, 2024. [38] Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. LLM as a system service on mobile devices, 2024. [39] Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. MoEfication: Transformer feed-forward layers are mixtures of experts. In Findings of ACL 2022, 2022. [40] Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. ReLU2 wins: Discovering efficient activation functions for sparse LLMs, 2024. Authors:
(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University (yzmizeyu@sjtu.edu.cn);
(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University. Authors: Authors: (1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University; (2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University; (3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University (yzmizeyu@sjtu.edu.cn); (4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University; (5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University; (6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University. This paper is available on arxiv under CC BY 4.0 license. This paper is available on arxiv under CC BY 4.0 license. available on arxiv available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Google

OpenAI

NVIDIA

Alethea

Comparing Efficiency Strategies for LLM Deployment and Summarizing PowerInfer‑2’s Impact

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

The Noonification: How Can Enterprises Utilize Edge Computer Vision? (1/14/2023)

51 Stories To Learn About Edge Computing

8 Cloud Computing Trends to Watch in 2021

A Guide to Enhancing Security at the IoT Edge (Part 2)

AI, Big Data, Blockchain, and Edge: Welcome to 2020

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

The Noonification: How Can Enterprises Utilize Edge Computer Vision? (1/14/2023)

51 Stories To Learn About Edge Computing

8 Cloud Computing Trends to Watch in 2021

A Guide to Enhancing Security at the IoT Edge (Part 2)

AI, Big Data, Blockchain, and Edge: Welcome to 2020

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps

Comparing Efficiency Strategies for LLM Deployment and Summarizing PowerInfer‑2’s Impact

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps

Comparing Efficiency Strategies for LLM Deployment and Summarizing PowerInfer‑2’s Impact