TurboSparse: Elite Inference Speed via dReLU Sparsity

Written by languagemodels | Published 2026/03/03
Tech Story Tags: language-models | turbosparse-inference | drelu-activation | moe-relufication | neuron-predictors | mobile-llm-speed | intrinsic-sparsity | parameter-activation

TLDRAchieve 2-5x faster LLM decoding on RTX 4090 and mobile devices using TurboSparse. Experience 97% parameter sparsity without performance loss.via the TL;DR App

Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

7.1 Experiments Setting

Baselines. We take llama.cpp [20] as our baselines for comparison. llama.cpp is the most representative inference framework.

Models. For PowerInfer and PowerInfer-2 [62], we deployed our sparsified models, while for llama.cpp, we employed the original models for speed comparison.

Hardware Configurations. All experiments were conducted on three distinct configurations:

• PC-Laptop: Intel i9-14900HX processor, 32GB host memory (67.2 GB/s bandwidth), an NVIDIA RTX 4090 GPU (16GB), and PCIe 4.0 interface (64GB/s bandwidth).

• PC-2080Ti: Intel i7-12700K processor (eight 4.9GHz cores), 64GB host memory (38.4 GB/s bandwidth), an NVIDIA RTX 2080Ti GPU (11GB), and PCIe 3.0 interface (32GB/s bandwidth).

• OnePlus-12: Equipped with a Snapdragon 8 Gen 3 SoC, 24 GB DRAM, and UFS 4.0 storage.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi [email protected]);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.


This paper is available on arxiv under CC BY 4.0 license.


Written by languagemodels | Large Language Models (LLMs) ushered in a technological revolution. We breakdown how the most important models work.
Published by HackerNoon on 2026/03/03