Table of Links
- Background and Motivation
- PowerInfer-2 Overview
- Neuron-Aware Runtime Inference
- Execution Plan Generation
- Implementation
- Evaluation
- Related Work
- Conclusion and References
6 Implementation
PowerInfer-2 is developed on top of PowerInfer [30], a stateof-the-art serving framework designed for sparsely-activated LLMs, by integrating an additional 12K lines of C++ code into PowerInfer [30]. These enhancements encompass several key areas, including the polymorphic neuron engine, neuron cache, flexible neuron loading, and neuron-cluster-level I/O pipeline.
Since PowerInfer-2 depends on privileged system APIs (e.g., mlock that locks pages in memory) that needs the root permission, we built it on the Android [5] platform. Even though there is no need to alter the system kernel, a rooted Android system still provides us with considerable flexibility in developing and debugging our system. Furthermore, PowerInfer-2 is inherently designed with no modifications to the kernel, making it easily portable to other operating systems, including iOS [14] platform.
The current implementation of PowerInfer-2 supports a diverse array of LLMs with varying model sizes, including Llama-2 family [27] (7B, 13B), TurboSparse-Mistral [31] (7B), and TurboSparse-Mixtral [31] (47B).
Authors:
(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University ([email protected]);
(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.
This paper is