Authors:
(1) Soham De, Google DeepMind and with Equal contributions;
(2) Samuel L. Smith, Google DeepMind and with Equal contributions;
(3) Anushan Fernando, Google DeepMind and with Equal contributions;
(4) Aleksandar Botev, Google DeepMind and with Equal contributions;
(5) George Cristian-Muraru, Google DeepMind and with Equal contributions;
(6) Albert Gu, Work done while at Google DeepMind;
(7) Ruba Haroun, Google DeepMind;
(8) Leonard Berrada, Google DeepMind;
(9) Yutian Chen, Google DeepMind;
(10) Srivatsan Srinivasan, Google DeepMind;
(11) Guillaume Desjardins, Google DeepMind;
(12) Arnaud Doucet, Google DeepMind;
(13) David Budden, Google DeepMind;
(14) Yee Whye Teh, Google DeepMind;
(15) David Budden, Google DeepMind;
(16) Razvan Pascanu, Google DeepMind;
(17) Nando De Freitas, Google DeepMind;
(18) Caglar Gulcehre, Google DeepMind.
Table of Links
3 Recurrent Models Scale as Efficiently as Transformers
3.2. Evaluation on downstream tasks
4.2. Efficient linear recurrences on device
4.3. Training speed on longer sequences
5.1. A simple model of the decode step
6. Long Context Modeling and 6.1. Improving next token prediction with longer contexts
6.2. Copy and retrieval capabilities
8. Conclusion, Acknowledgements, and References
B. Complex-Gated Linear Recurrent Unit (CG-LRU)
C. Model Scale Hyper-Parameters
D. Efficient Linear Recurrences on Device
E. The Local Attention Window Size of Griffin
G. Improving Next Token Prediction with Longer Contexts: Additional Results
H. Additional Details of the Copy and Retrieval Tasks
6.2. Copy and retrieval capabilities
Recent work (Jelassi et al., 2024) has shown that Transformers can be significantly more efficient than state space models (SSMs), a popular new family of RNNs, at learning synthetic tasks such as copying the context or retrieving relevant tokens from the context. Additionally, Jelassi et al. (2024) showed that pre-trained Transformers such as Pythia (Biderman et al., 2023) are much better at copying and retrieval tasks at evaluation time compared to pre-trained SSM models such as Mamba (Gu and Dao, 2023). In this section, we investigate the efficiency of Griffin and Hawk in learning how to copy and retrieve tokens from the context. Additionally, we evaluate pre-trained Hawk and Griffin models on a phone number lookup task designed to test both copying and retrieval capabilities.
Training on synthetic tasks To investigate the efficiency of learning how to copy and retrieve relevant tokens from the context, we train on two synthetic tasks: Selective Copying and Induction Heads. To be able to compare Transformers with Hawk and Griffin, we consider 5-block deep networks with model dimension 64, totalling roughly 250K parameters, where Griffin uses a single local attention in the middle of the network, in the third block.
• Selective copying task: In this task, the model needs to learn to copy data tokens from a sequence while ignoring noise tokens from the context. See Appendix H for more details on the setup for this task. This task is inspired by Gu and Dao (2023), where the authors showed that Mamba was able to solve this task better than previously proposed SSMs. We use a vocabulary size of 16, and train on sequences of length 1024, containing 16 data tokens (randomly sampled from the vocabulary and at random locations), with the rest of the tokens set to the noise token. Griffin uses a local attention window size of 512.
• Induction heads: In this task, the model needs to learn to recall the token immediately following a special token. This requires the model to learn the special token, and retrieve the token immediately following it in the context. If the model is able to learn the task, it should be able to extrapolate to significantly longer sequences than it was trained for. We use a vocabulary size of 16 and train on sequences of length 256 where the tokens are sampled randomly, and we randomly sample the location of the special token in the sequence. Griffin uses a local attention window of size 128.
We show our results in Figure 6. On the Selective Copying task, we find that all 3 models are able to solve the task perfectly. When comparing speed of learning on this task, we find Hawk to be significantly slower than Transformers, similar to the observation made by Jelassi et al. (2024), where the authors showed that Mamba was significantly slower to learn on similar tasks. Interestingly though, Griffin shows almost no slowdown, effectively matching the speed of learning of Transformers, despite using only a single local attention layer.
On the Induction Heads task, while all 3 models can solve the task perfectly up to the training sequence length, our Transformer baseline is not able to extrapolate to longer sequences during evaluation. While our MQA baseline uses RoPE, Gu and Dao (2023) had similar observation for Transformers with a range of positional encodings. We find that Hawk is able to perfectly extrapolate on this task to evaluation sequences several orders of magnitude longer than the training sequence length. Notably, Griffin, with its local attention, also demonstrated exceptional ability to extrapolate on this task.
Evaluating pre-trained models We now evaluate whether copying and retrieval capabilities naturally emerge in our pre-trained models. We consider our 7B Hawk and Griffin models and our 6B MQA Transformer baseline, all trained on 300B tokens on the MassiveText dataset. We consider the same phonebook lookup task introduced in Jelassi et al. (2024), where we provide to the model a synthetic phonebook containing names and numbers, and the model is asked to retrieve the correct phone number given a name. The prompt to the model is a phonebook consisting of randomly sampled list of names and numbers of a certain length, followed by two randomly sampled examples of the task, followed by a randomly sampled name from the phonebook for which the model needs to retrieve the correct phone number.
From Figure 6(c), we see that while Hawk can do reasonably well on the task for very short phonebook lengths, it fails to memorize and retrieve the correct phone number when the phonebook length grows, similar to the observation made by Jelassi et al. (2024) on the Mamba model’s performance on this task. This is not particularly surprising since Hawk uses a small fixed-size state. Our Transformer baseline can almost perfectly solve this task up to the training sequence length, but fails to retrieve the correct phone number for context lengths longer than the training sequence length. Interestingly, Griffin can perfectly solve this task up to a context length that matches its local attention window size of 1024, in spite of using only a single local attention layer. Once the context length is long enough such that the local attention window does not cover the whole phonebook, performance starts to degrade. Griffin is also able to extrapolate better to longer sequence lengths compared to Transformers. While the performance of Griffin is promising for the ability of models with fixed-size state to solve copying and retrieval tasks, our results suggest more work is needed to improve these capabilities for such models.
This paper is available on arxiv under CC BY 4.0 DEED license.