Optimizing LLM Learning: Multi-Token Cross-Entropy Loss Explained

Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Method 2. Method 3. Experiments on real data 3. Experiments on real data 4. Ablations on synthetic data 4. Ablations on synthetic data 5. Why does it work? Some speculation 5. Why does it work? Some speculation 6. Related work 6. Related work 7. Conclusion, Impact statement, Environmental impact, Acknowledgements and References 7. Conclusion, Impact statement, Environmental impact, Acknowledgements and References A. Additional results on self-speculative decoding A. Additional results on self-speculative decoding B. Alternative architectures B. Alternative architectures C. Training speeds C. Training speeds D. Finetuning D. Finetuning E. Additional results on model scaling behavior E. Additional results on model scaling behavior F. Details on CodeContests finetuning F. Details on CodeContests finetuning G. Additional results on natural language benchmarks G. Additional results on natural language benchmarks H. Additional results on abstractive text summarization H. Additional results on abstractive text summarization I. Additional results on mathematical reasoning in natural language I. Additional results on mathematical reasoning in natural language J. Additional results on induction learning J. Additional results on induction learning K. Additional results on algorithmic reasoning K. Additional results on algorithmic reasoning L. Additional intuitions on multi-token prediction L. Additional intuitions on multi-token prediction M. Training hyperparameters M. Training hyperparameters 2. Method Standard language modeling learns about a large text corpus x1, . . . xT by implementing a next-token prediction task. Formally, the learning objective is to minimize the cross-entropy loss In this work, we generalize the above by implementing a multi-token prediction task, where at each position of the training corpus, the model is instructed to predict n future tokens at once. This translates into the cross-entropy loss Authors: (1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution; (2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution; (3) Baptiste Rozière, FAIR at Meta; (4) David Lopez-Paz, FAIR at Meta and a last author; (5) Gabriel Synnaeve, FAIR at Meta and a last author. Authors: Authors: (1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution; (2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution; (3) Baptiste Rozière, FAIR at Meta; (4) David Lopez-Paz, FAIR at Meta and a last author; (5) Gabriel Synnaeve, FAIR at Meta and a last author. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv available on arxiv