Reducing Memory Overhead in AI Models

Here, we outline the proof that the proposed bifurcated attention in Equation 3 and 4 recovers the same attention as the operations in 1 and 2 for the case of single-context batch sampling. We use the fact that the KV part corresponding to context length, all the batch indices correspond to the tensors.

E.2. Detailed Memory I/O Analysis

Overall, the memory I/O complexity changes from

• Original memory I/O cost: bhnk + bgmk + bhnm (for ⟨q, K⟩) + bhnm + bgmk + bnd (for ⟨w, V ⟩)

• Bifurcated attention memory I/O cost: bhnk + gmck + bgmdk + bhnm (for ⟨q, K⟩) + bhnm + gmck + bgmdk + bnd (for ⟨w, V ⟩)

There is an associated memory IO to write the ⟨w, Vc⟩ and ⟨w, Vd⟩ output twice. However, it is typically very small (bnd) compared to the IO of KV cache component bgmk since m >> n = 1.

E.3. Implementation of Bifurcated Attention

Despite the dramatic gain in inference efficiency of the bifurcated attention, we demonstrate the simplicity of our implementation involving 20 lines of code using Pytorch (Paszke et al., 2019).

This paper is available on arxiv under CC BY 4.0 DEED license.