Table of Links Abstract and 1. Introduction Abstract and 1. Introduction Related Work and Background


Analysis
3.1 Limitations about Existing ReLUficatio
3.2 dReLU


Are Neurons in Expert still Sparsely Activated?


dReLU Sparsification


Experiments Results
6.1 Downstream Tasks Performance
6.2 Sparsity of Sparsified Models


Practical Inference Speedup Evaluation
7.1 Experiments Setting
7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference
7.4 Deploy LLMs on mobile phones


Conclusion and References Related Work and Background Related Work and Background Related Work and Background Analysis
3.1 Limitations about Existing ReLUficatio
3.2 dReLU Analysis 3.1 Limitations about Existing ReLUficatio 3.1 Limitations about Existing ReLUficatio 3.2 dReLU 3.2 dReLU Are Neurons in Expert still Sparsely Activated? Are Neurons in Expert still Sparsely Activated? Are Neurons in Expert still Sparsely Activated? dReLU Sparsification dReLU Sparsification dReLU Sparsification Experiments Results
6.1 Downstream Tasks Performance
6.2 Sparsity of Sparsified Models Experiments Results 6.1 Downstream Tasks Performance 6.1 Downstream Tasks Performance 6.2 Sparsity of Sparsified Models 6.2 Sparsity of Sparsified Models Practical Inference Speedup Evaluation
7.1 Experiments Setting
7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference
7.4 Deploy LLMs on mobile phones Practical Inference Speedup Evaluation Practical Inference Speedup Evaluation 7.1 Experiments Setting 7.1 Experiments Setting 7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference 7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference 7.4 Deploy LLMs on mobile phones 7.4 Deploy LLMs on mobile phones Conclusion and References Conclusion and References Conclusion and References A. Appendix / supplemental material A. Appendix / supplemental material B. Limitation B. Limitation C. Broader Impact C. Broader Impact 7.2 Pure CPU Inference In this subsection, we focus on utilizing only the CPU for inference in our models. Due to limitations in DRAM, our evaluations are constrained to CPU performance. Table 7 presents the decoding speed results achieved with CPU-only processing for different models and settings. The table provides a comparison of decoding speeds (in tokens per second) for various models under different settings using CPU-only inference. Overall, our ReLUfied models can achieve 2.08-2.28× speedup over the original model. 7.3 Hybrid GPU-CPU Inference In this subsection, we shift our focus to evaluating our models in a hybrid GPU-CPU computing environment, considering that most PCs are equipped with consumer-grade GPUs. Table 8 presents the decoding speed results achieved with hybrid GPU-CPU computing for different models and settings. The table below provides a comparison of decoding speeds (in tokens per second) for various models under different settings using a combination of GPU and CPU for inference. Overall, our models demonstrate significant speedups ranging from 2.52 to 4.64× compared to the baseline llama.cpp. Authors:
(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;
(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;
(5) Li Ma, Shanghai Artificial Intelligence Laboratory;
(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn);
(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University. Authors: Authors: (1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University; (2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University; (3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University; (4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University; (5) Li Ma, Shanghai Artificial Intelligence Laboratory; (6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi yzmizeyu@sjtu.edu.cn); (7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University. This paper is available on arxiv under CC BY 4.0 license. This paper is available on arxiv under CC BY 4.0 license. available on arxiv available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

TurboSparse Mobile: 22x Faster Mixtral Inference on PowerInfer-2

TurboSparse Inference: 4.6x Faster LLM Decoding via Hybrid GPU-CPU Computing

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

256K Tokens on One GPU? Jamba’s Engineering Magic Explained

Breaking down GPU VRAM consumption

Conclusion: vAttention for Simplified, High-Performance LLM Inference

Make Your Data Pipelines 5X Faster with Adaptive Batching

100 Days of AI Day 4: Maximizing Productivity & Creativity with ChatGPT

A Practical 5-Step Guide to Do Semantic Search on Your Private Data With the Help of LLMs

256K Tokens on One GPU? Jamba’s Engineering Magic Explained

Breaking down GPU VRAM consumption

Conclusion: vAttention for Simplified, High-Performance LLM Inference

Make Your Data Pipelines 5X Faster with Adaptive Batching

100 Days of AI Day 4: Maximizing Productivity & Creativity with ChatGPT

A Practical 5-Step Guide to Do Semantic Search on Your Private Data With the Help of LLMs

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps