TurboSparse Limitations: The Impact of 150B Token Recovery Training

Written by languagemodels | Published 2026/03/05
Tech Story Tags: language-models | training-token-constraints | llama-3-comparison | model-capability-deficiencies | performance-mitigation | efficient-recovery-training | llm-scalability | future-model-optimization

TLDRExplore the current limitations of dReLU sparsification. While achieving 90% sparsity, TurboSparse models currently utilize 1% of the training tokens used by Llama-3, with further training expected to enhance model capabilities.via the TL;DR App

Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

B Limitation

Our models have only undergone continued training on 150B tokens. Compared to the 15T tokens used in pre-training for Llama-3 [60], the limited number of training tokens still results in some deficiencies in the model’s capabilities. We are optimistic that further training can help to mitigate these shortcomings.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi [email protected]);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.


This paper is available on arxiv under CC BY 4.0 license.


Written by languagemodels | Large Language Models (LLMs) ushered in a technological revolution. We breakdown how the most important models work.
Published by HackerNoon on 2026/03/05