This story draft by @escholar has not been reviewed by an editor, YET.
Authors:
(1) Soham De, Google DeepMind and with Equal contributions;
(2) Samuel L. Smith, Google DeepMind and with Equal contributions;
(3) Anushan Fernando, Google DeepMind and with Equal contributions;
(4) Aleksandar Botev, Google DeepMind and with Equal contributions;
(5) George Cristian-Muraru, Google DeepMind and with Equal contributions;
(6) Albert Gu, Work done while at Google DeepMind;
(7) Ruba Haroun, Google DeepMind;
(8) Leonard Berrada, Google DeepMind;
(9) Yutian Chen, Google DeepMind;
(10) Srivatsan Srinivasan, Google DeepMind;
(11) Guillaume Desjardins, Google DeepMind;
(12) Arnaud Doucet, Google DeepMind;
(13) David Budden, Google DeepMind;
(14) Yee Whye Teh, Google DeepMind;
(15) David Budden, Google DeepMind;
(16) Razvan Pascanu, Google DeepMind;
(17) Nando De Freitas, Google DeepMind;
(18) Caglar Gulcehre, Google DeepMind.
3 Recurrent Models Scale as Efficiently as Transformers
3.2. Evaluation on downstream tasks
4.2. Efficient linear recurrences on device
4.3. Training speed on longer sequences
5.1. A simple model of the decode step
6. Long Context Modeling and 6.1. Improving next token prediction with longer contexts
6.2. Copy and retrieval capabilities
8. Conclusion, Acknowledgements, and References
B. Complex-Gated Linear Recurrent Unit (CG-LRU)
C. Model Scale Hyper-Parameters
D. Efficient Linear Recurrences on Device
E. The Local Attention Window Size of Griffin
G. Improving Next Token Prediction with Longer Contexts: Additional Results
H. Additional Details of the Copy and Retrieval Tasks
Griffin uses both recurrent blocks as well as local attention layers in its temporal mixing blocks. For all experiments previously shown using a training sequence length of 2048, we use a local attention window size of 1024. We now investigate how the performance of different window sizes for the local attention layer varies with the training sequence length.
We consider 400M parameter models trained on sequence lengths of 2048, 4096 and 8192 tokens,
where we keep the total number of training tokens fixed. For each sequence length, we train Griffin models using different local attention window sizes. As baselines, we train MQA Transformers using global attention layers, as well MQA Transformers using local attention layers with different window sizes. The results are shown in Figure 9, where the window sizes used are shown on top of each bar (MQA Transformer bars with window size equal to the training sequence length are the global attention MQA Transformer baseline).
From Figure 9, we see that remarkably, even when using a fixed window size of 1024 for the local attention layers in Griffin, it outperforms the global attention MQA Transformer baseline across all sequence lengths tested. However, it is worth noting that the performance gap between Griffin with local attention window 1024 and the global attention MQA Transformer reduces as the sequence length grows. Therefore, if the sequence length grows further, it is likely important to slowly also grow the local attention window size. In practice, the hardware used will also heavily determine the optimal local attention window size in terms of training and inference speed. Finally, we note that MQA Transformers purely using local attention (window sizes less than the training sequence length) perform significantly worse than both global attention MQA Transformers, as well as Griffin.
This paper is available on arxiv under CC BY 4.0 DEED license.