Table of Links
- Background and Motivation
- PowerInfer-2 Overview
- Neuron-Aware Runtime Inference
- Execution Plan Generation
- Implementation
- Evaluation
- Related Work
- Conclusion and References
8 Related Work
Resource-Efficient LLM. Deploying LLMs on resourcer-estricted devices has become more and more popular [37]. A representative framework is MLC-LLM [33], which enables native deployment of many large language models on mobile devices with GPU acceleration. However, it is limited to in-memory computation scenarios, and fails to run when the model is too large to fit in memory. There are also other approaches, such as network pruning [15, 22], knowledge distillation [16], and quantization [8, 20] to reduce model memory footprints. These approaches are orthogonal to PowerInfer-2, and can be used together with PowerInfer-2 to further improve the efficiency of deploying LLM on mobile devices.
Speculative Decoding. Speculative decoding can also be utilized to enhance inference speed [7,12,18]. This technique uses a smaller model (e.g., 1B parameters) to quickly generate multiple candidate tokens and then validates them with a larger model (e.g., 13B parameters) in a batch. Only tokens accepted by the larger model will be displayed to users. By verifying multiple tokens at a time, SpecInfer [23] can reduce the number of decoding steps. In the offloading scenario, however, the large amount of I/O loading from flash storages becomes the bottleneck of speculative decoding, especially for MoE models, which requires to load all experts for one batch, losing the benefit of sparse activation of experts.
Authors:
(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University ([email protected]);
(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.
This paper is