TurboSparse Efficiency: Achieving 97% Parameter Sparsity in Mixtral-47B

Written by languagemodels | Published 2026/03/03
Tech Story Tags: language-models | moe-parameter-activation | relufication-efficiency | mistral-7b-sparsity | mixtral-47b-performance | mixtral-47b-flops | zero-valued-activations | inference-speed-optimization

TLDRDiscover how TurboSparse-Mistral-7B and Mixtral-47B leverage ReLUfication to reach up to 90% neuron inactivity, reducing active parameters to just 3% per MoE layer.via the TL;DR App

Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

6.2 Sparsity of Sparsified Models

In this subsection, we report our models’ sparsity. We first profile the proportion of zero-valued activations for every layer with a general dataset(fineweb), as shown in Figure 6. By considering activations with a value of zero, we find that for TurboSparse-Mistral-7B, on average, has 90% of the neurons inactive in each layer. For TurboSparse-Mixtral-47B, this percentage is slightly lower at 85% on average for each expert FFN. Originally, Mixtral-47B would activate 2 out of 8 experts in each layer, introducing 75% sparsity, meaning only 25% of FLOPs needed to be computed. Furthermore, after ReLUfication, each expert will only activate 15% of neurons. Combining these, in inference, only 3% of parameters in each MoE layer will be activated.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi [email protected]);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.


This paper is available on arxiv under CC BY 4.0 license.


Written by languagemodels | Large Language Models (LLMs) ushered in a technological revolution. We breakdown how the most important models work.
Published by HackerNoon on 2026/03/03