TurboSparse: Faster LLMs via dReLU Activation

Written by languagemodels | Published 2026/03/05
Tech Story Tags: llm | drelu-activation | fast-llm-inference | 97percent-moe-sparsity | turbosparse-mixtral-47b | hybrid-gpu-cpu-speed | powerinfer-integration | neuron-level-predictors

TLDRBoost LLM speeds by 2–5x with TurboSparse. Use dReLU to reach 90% sparsity in Mistral and Mixtral models without losing performance.via the TL;DR App

Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

A Appendix / supplemental material

A.1 Training Details of 300M models

In this subsection, we will introduce the details of training the 300M model, including the model architecture, types of data used, and hyperparameters. The evaluation results of the final 300M models are shown in Table 10.

A.1.1 Architecture

We adopt a similar model architecture to Llama 2 [60] with the following details:

Activation Function and Intermediate Hidden Size. We focus on dReLU and SwiGLU [52] activation functions.

Multi-Head Attention. For Attention block, we adopt the Llama-2-7B’s architecture, apply prenormalization using RMSNorm [66] and RoPE [57] for Positional embedding.

A.1.2 Training Hyperparameters

We utilize LLaMA-Factory as our training framework [70]. Our models are trained using AdamW optimizer [38], with the following hyper-parameters: β1 = 0.9, β2 = 0.95. We adopt a cosine learning rate schedule and we set weight decay to 0.01 and gradient clipping hyper-parameters. (see Table 12 for more details).

A.2 Activation Distribution Analysis of MoE Models

Figure 7 shows the activation distribution of Mistral and Mixtral. We can see that the FFN in MoE models show the similar activation distribution compared to dense Mistral models.

A.3 Details Performance of ReLUfied Models

In this subsection, we present the detailed performance metrics of our ReLUfied models across various commonsense benchmarks, as shown in Table 13.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi [email protected]);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.


This paper is available on arxiv under CC BY 4.0 license.


Written by languagemodels | Large Language Models (LLMs) ushered in a technological revolution. We breakdown how the most important models work.
Published by HackerNoon on 2026/03/05