Table of Links
-
Analysis
-
Experiments Results
-
Practical Inference Speedup Evaluation
A. Appendix / supplemental material
7.1 Experiments Setting
Baselines. We take llama.cpp [20] as our baselines for comparison. llama.cpp is the most representative inference framework.
Models. For PowerInfer and PowerInfer-2 [62], we deployed our sparsified models, while for llama.cpp, we employed the original models for speed comparison.
Hardware Configurations. All experiments were conducted on three distinct configurations:
• PC-Laptop: Intel i9-14900HX processor, 32GB host memory (67.2 GB/s bandwidth), an NVIDIA RTX 4090 GPU (16GB), and PCIe 4.0 interface (64GB/s bandwidth).
• PC-2080Ti: Intel i7-12700K processor (eight 4.9GHz cores), 64GB host memory (38.4 GB/s bandwidth), an NVIDIA RTX 2080Ti GPU (11GB), and PCIe 3.0 interface (32GB/s bandwidth).
• OnePlus-12: Equipped with a Snapdragon 8 Gen 3 SoC, 24 GB DRAM, and UFS 4.0 storage.
Authors:
(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;
(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(5) Li Ma, Shanghai Artificial Intelligence Laboratory;
(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi [email protected]);
(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.
This paper is
