This story draft by @escholar has not been reviewed by an editor, YET.

PowerInfer-2: Fast Large Language Model Inference on a Smartphone: Related Work

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Background and Motivation
  2. PowerInfer-2 Overview
  3. Neuron-Aware Runtime Inference
  4. Execution Plan Generation
  5. Implementation
  6. Evaluation
  7. Related Work
  8. Conclusion and References

8 Related Work

Resource-Efficient LLM. Deploying LLMs on resourcer-estricted devices has become more and more popular [37]. A representative framework is MLC-LLM [33], which enables native deployment of many large language models on mobile devices with GPU acceleration. However, it is limited to in-memory computation scenarios, and fails to run when the model is too large to fit in memory. There are also other approaches, such as network pruning [15, 22], knowledge distillation [16], and quantization [8, 20] to reduce model memory footprints. These approaches are orthogonal to PowerInfer-2, and can be used together with PowerInfer-2 to further improve the efficiency of deploying LLM on mobile devices.


Speculative Decoding. Speculative decoding can also be utilized to enhance inference speed [7,12,18]. This technique uses a smaller model (e.g., 1B parameters) to quickly generate multiple candidate tokens and then validates them with a larger model (e.g., 13B parameters) in a batch. Only tokens accepted by the larger model will be displayed to users. By verifying multiple tokens at a time, SpecInfer [23] can reduce the number of decoding steps. In the offloading scenario, however, the large amount of I/O loading from flash storages becomes the bottleneck of speculative decoding, especially for MoE models, which requires to load all experts for one batch, losing the benefit of sparse activation of experts.


Authors:

(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University ([email protected]);

(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.


This paper is available on arxiv under CC BY 4.0 license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks