Understanding Multi-Group Attention in AI Models

We show in Table 4 that the memory IO cost for ⟨q, K⟩ is dominated by the loading of K which costs bmhk in the case of multihead where g = h. This cost is particularly high due to the coupling of batch size b, context length m, and the entire hidden dimension d. Compared to the number of computations, which has complexity bmd, this attention module requires one memory IO per one tensor operation (memory-io bound). In contrast, other operations such as feedforw can be the main bottleneck for incremental decoding and our paper aims to tackle such problems.ard has much lower ratio of memory IO per compute (compute bound). These attention computation

D.2. Model FLOPs

The scaling laws by Kaplan et al. (2020) shows that the modelrelated FLOPs during the forward pass is 2N where N is the number of parameters (without the embeddings). We show that it holds for a general multi-group model as well. The only difference between the multi-group and the multi-head case is the projection PK and PV where they are of size dgk instead of dhk. Since this is a linear layer, the forward pass FLOPs for any input is still proportional such projection size. Therefore, it follows that for any multi-group attention, including multi-head, the forward FLOPs is 2N where N is the respective number of parameters.

D.3. Comparing Capabilities-Equivalent Models

This section outlines the analysis of latency change when we switch from an MH model to an MG model with F times the size.

D.3.1. CONTEXT ENCODING

D.3.2. INCREMENTAL DECODING

This paper is available on arxiv under CC BY 4.0 DEED license.