TurboSparse Inference: 4.6x Faster LLM Decoding via Hybrid GPU-CPU Computing

Written by languagemodels | Published 2026/03/04
Tech Story Tags: language-models | pure-cpu-inference | hybrid-gpu-cpu-computing | decoding-speed-benchmarks | relufied-model-acceleration | llama.cpp-comparison | gpu-optimization | dram-constrained-evaluation

TLDRAccelerate LLM inference with TurboSparse. Achieve up to 2.28x speedup on pure CPU and 4.64x in hybrid GPU-CPU environments compared to llama.cpp baselines.via the TL;DR App

Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

7.2 Pure CPU Inference

In this subsection, we focus on utilizing only the CPU for inference in our models. Due to limitations in DRAM, our evaluations are constrained to CPU performance. Table 7 presents the decoding speed results achieved with CPU-only processing for different models and settings.

The table provides a comparison of decoding speeds (in tokens per second) for various models under different settings using CPU-only inference. Overall, our ReLUfied models can achieve 2.08-2.28× speedup over the original model.

7.3 Hybrid GPU-CPU Inference

In this subsection, we shift our focus to evaluating our models in a hybrid GPU-CPU computing environment, considering that most PCs are equipped with consumer-grade GPUs. Table 8 presents the decoding speed results achieved with hybrid GPU-CPU computing for different models and settings.

The table below provides a comparison of decoding speeds (in tokens per second) for various models under different settings using a combination of GPU and CPU for inference. Overall, our models demonstrate significant speedups ranging from 2.52 to 4.64× compared to the baseline llama.cpp.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi [email protected]);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.


This paper is available on arxiv under CC BY 4.0 license.


Written by languagemodels | Large Language Models (LLMs) ushered in a technological revolution. We breakdown how the most important models work.
Published by HackerNoon on 2026/03/04