Recurrent Models: Enhancing Latency and Throughput Efficiency

Here, cache size refers to either the size of the KV cache at batch size 1 (for Transformers), or to the size of the recurrent state at batch size 1 (for RNNs).

Cache sizes The difference in cache size relative to model parameters has important implications for sampling efficiency. In recurrent and local attention blocks, parameter loading is the primary bottleneck, (because the cache size is substantially smaller). In contrast, global attention’s KV cache scales with the sequence length 𝑇 and can be comparable to, or even exceed, the size of the model parameters. This introduces considerable overhead when the sequence length 𝑇 is large enough (as shown in F.4). Consequently, an equally sized recurrent model can exhibit substantially lower latency than a Transformer when 𝑇 is large. Note however that as the model size grows the sequence length at which we see latency benefits (where the KV cache size is comparable to parameter size) also increases. It is important to note that, as well as improving latency, having a small recurrent state can also increase the largest batch size that fits in memory on a single device, leading to higher throughput.

Figure 4 | Latency of different 1B parameter models for a range of sequence lengths for (a) sampling from an empty prefill and (b) sampling from a prefill of 4k tokens.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Recurrent Models: Decoding Faster with Lower Latency and Higher Throughput

Up Next →

Hawk and Griffin Models: Superior Latency and Throughput in AI Inference