Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Method 2. Method 3. Experiments on real data 3. Experiments on real data 3.1. Benefits scale with model size and 3.2. Faster inference 3.1. Benefits scale with model size and 3.2. Faster inference 3.3. Learning global patterns with multi-byte prediction and 3.4. Searching for the optimal n 3.3. Learning global patterns with multi-byte prediction and 3.4. Searching for the optimal n n 3.5. Training for multiple epochs and 3.6. Finetuning multi-token predictors 3.5. Training for multiple epochs and 3.6. Finetuning multi-token predictors 3.7. Multi-token prediction on natural language 3.7. Multi-token prediction on natural language 4. Ablations on synthetic data and 4.1. Induction capability 4. Ablations on synthetic data and 4.1. Induction capability 4.2. Algorithmic reasoning 4.2. Algorithmic reasoning 5. Why does it work? Some speculation and 5.1. Lookahead reinforces choice points 5. Why does it work? Some speculation and 5.1. Lookahead reinforces choice points 5.2. Information-theoretic argument 5.2. Information-theoretic argument 6. Related work 6. Related work 7. Conclusion, Impact statement, Environmental impact, Acknowledgements, and References 7. Conclusion, Impact statement, Environmental impact, Acknowledgements, and References A. Additional results on self-speculative decoding A. Additional results on self-speculative decoding B. Alternative architectures B. Alternative architectures C. Training speeds C. Training speeds D. Finetuning D. Finetuning E. Additional results on model scaling behavior E. Additional results on model scaling behavior F. Details on CodeContests finetuning F. Details on CodeContests finetuning G. Additional results on natural language benchmarks G. Additional results on natural language benchmarks H. Additional results on abstractive text summarization H. Additional results on abstractive text summarization I. Additional results on mathematical reasoning in natural language I. Additional results on mathematical reasoning in natural language J. Additional results on induction learning J. Additional results on induction learning K. Additional results on algorithmic reasoning K. Additional results on algorithmic reasoning L. Additional intuitions on multi-token prediction L. Additional intuitions on multi-token prediction M. Training hyperparameters M. Training hyperparameters A. Additional results on self-speculative decoding This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors:
(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally;
(2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally;
(3) Baptiste Rozière, FAIR at Meta;
(4) David Lopez-Paz, FAIR at Meta and his the last author;
(5) Gabriel Synnaeve, FAIR at Meta and the last author. Authors: Authors: (1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally; (2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally; (3) Baptiste Rozière, FAIR at Meta; (4) David Lopez-Paz, FAIR at Meta and his the last author; (5) Gabriel Synnaeve, FAIR at Meta and the last author.

Self-Speculative Decoding Speeds for Multi-Token LLMs

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Formalization of the SylloBio-NLI Resource Generation Process

ResponseDial For ChatGPT: Optimize Your Workflow and Take Control of Your GPT Responses

Quantizing Large Language Models: Can We Maintain Accuracy?

The Science of "Cherry" Parameters: Why Some LLM Weights Matter More

The Perplexity Puzzle: How Low-Bit Quantization Affects AI Accuracy

Can ChatGPT-Style Models Survive Quantization?

A Formalization of the SylloBio-NLI Resource Generation Process

ResponseDial For ChatGPT: Optimize Your Workflow and Take Control of Your GPT Responses

Quantizing Large Language Models: Can We Maintain Accuracy?

The Science of "Cherry" Parameters: Why Some LLM Weights Matter More

The Perplexity Puzzle: How Low-Bit Quantization Affects AI Accuracy

Can ChatGPT-Style Models Survive Quantization?

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps