Authors:
(1) Ben Athiwaratkun, AWS AI Labs;
(2) Sujan Kumar Gonugondla, AWS AI Labs;
(3) Sanjay Krishna Gouda, AWS AI Labs;
(4) Haifeng Qian, AWS AI Labs;
(5) Sanjay Krishna Gouda, AWS AI Labs;
(6) Hantian Ding, AWS AI Labs;
(7) Qing Sun, AWS AI Labs;
(8) Jun Wang, AWS AI Labs;
(9) Jiacheng Guo, AWS AI Labs;
(10 Liangfu Chen, AWS AI Labs;
(11) Parminder Bhatia, GE HealthCare (work done at AWS);
(12) Ramesh Nallapati, Amazon AGI (work done at AWS);
(13) Sudipta Sengupta, AWS AI Labs;
(14) Bing Xiang, Goldman Sachs (work done at AWS). Table of Links Abstract and 1 Introduction 2. Related Work 3. Background 3.1. Notation and 3.2. Language Model Inference 3.3. Multi-Query, Multi-Head and the Generalized Multi-Query Attention 4. Context-Aware Bifurcated Attention and 4.1. Motivation 4.2. Formulation and 4.3. Memory IO Complexity 5. Experiments 5.1. Comparing Capabilities of Multi-Head, Multi-Query, and Multi-Group Attention 5.2. Latencies of Capabilities-Equivalent Models 5.3. Applications 6. Conclusion and References A. FAQs B. Related Work C. Setup D. Multi-Group Attention Family E. Context-Aware Bifurcated Attention F. Applications: Additional Results G. Compatibility with Speculative Decoding and Fast Decoding techniques 4. Context-Aware Bifurcated Attention In this section, we present a novel context-aware bifurcated attention method that aims to reduce the memory IO cost during incremental decoding by efficiently handling the computation of attention for shared context across samples, as shown in Figure 2. 4.1. Motivation 4.2. Formulation The proposed operations yield the exact same results ⟨w, V ⟩ as the original attention in Equation 1 and 2, but can significantly reduce memory I/O during incremental decoding (proof in Appendix E.1). 4.3. Memory IO Complexity The memory IO complexity corresponding to loading KV changes from memory IO w/o bifurcated attention = gk · bm (5) = gk · b(mc + md) memory IO w. bifurcated attention = gk · (mc + bmd) (6) This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Ben Athiwaratkun, AWS AI Labs; (2) Sujan Kumar Gonugondla, AWS AI Labs; (3) Sanjay Krishna Gouda, AWS AI Labs; (4) Haifeng Qian, AWS AI Labs; (5) Sanjay Krishna Gouda, AWS AI Labs; (6) Hantian Ding, AWS AI Labs; (7) Qing Sun, AWS AI Labs; (8) Jun Wang, AWS AI Labs; (9) Jiacheng Guo, AWS AI Labs; (10 Liangfu Chen, AWS AI Labs; (11) Parminder Bhatia, GE HealthCare (work done at AWS); (12) Ramesh Nallapati, Amazon AGI (work done at AWS); (13) Sudipta Sengupta, AWS AI Labs; (14) Bing Xiang, Goldman Sachs (work done at AWS). Authors: Authors: (1) Ben Athiwaratkun, AWS AI Labs; (2) Sujan Kumar Gonugondla, AWS AI Labs; (3) Sanjay Krishna Gouda, AWS AI Labs; (4) Haifeng Qian, AWS AI Labs; (5) Sanjay Krishna Gouda, AWS AI Labs; (6) Hantian Ding, AWS AI Labs; (7) Qing Sun, AWS AI Labs; (8) Jun Wang, AWS AI Labs; (9) Jiacheng Guo, AWS AI Labs; (10 Liangfu Chen, AWS AI Labs; (11) Parminder Bhatia, GE HealthCare (work done at AWS); (12) Ramesh Nallapati, Amazon AGI (work done at AWS); (13) Sudipta Sengupta, AWS AI Labs; (14) Bing Xiang, Goldman Sachs (work done at AWS). Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Work 2. Related Work 3. Background 3. Background 3.1. Notation and 3.2. Language Model Inference 3.1. Notation and 3.2. Language Model Inference 3.3. Multi-Query, Multi-Head and the Generalized Multi-Query Attention 3.3. Multi-Query, Multi-Head and the Generalized Multi-Query Attention 4. Context-Aware Bifurcated Attention and 4.1. Motivation 4. Context-Aware Bifurcated Attention and 4.1. Motivation 4.2. Formulation and 4.3. Memory IO Complexity 4.2. Formulation and 4.3. Memory IO Complexity 5. Experiments 5. Experiments 5.1. Comparing Capabilities of Multi-Head, Multi-Query, and Multi-Group Attention 5.1. Comparing Capabilities of Multi-Head, Multi-Query, and Multi-Group Attention 5.2. Latencies of Capabilities-Equivalent Models 5.2. Latencies of Capabilities-Equivalent Models 5.3. Applications 5.3. Applications 6. Conclusion and References 6. Conclusion and References A. FAQs A. FAQs B. Related Work B. Related Work C. Setup C. Setup D. Multi-Group Attention Family D. Multi-Group Attention Family E. Context-Aware Bifurcated Attention E. Context-Aware Bifurcated Attention F. Applications: Additional Results F. Applications: Additional Results G. Compatibility with Speculative Decoding and Fast Decoding techniques G. Compatibility with Speculative Decoding and Fast Decoding techniques 4. Context-Aware Bifurcated Attention In this section, we present a novel context-aware bifurcated attention method that aims to reduce the memory IO cost during incremental decoding by efficiently handling the computation of attention for shared context across samples, as shown in Figure 2. 4.1. Motivation 4.2. Formulation The proposed operations yield the exact same results ⟨w, V ⟩ as the original attention in Equation 1 and 2, but can significantly reduce memory I/O during incremental decoding (proof in Appendix E.1). 4.3. Memory IO Complexity The memory IO complexity corresponding to loading KV changes from memory IO w/o bifurcated attention = gk · bm (5) = gk · b(mc + md) memory IO w. bifurcated attention = gk · (mc + bmd) (6) This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Abstract

Amazon

A Little Less Memory, A Lot More Speed: The Science of Bifurcated Attention

Why Multi-Query Attention Matters for Large Language Models

Read My Stories

Too Long; Didn't Read

Make resilience your competitive advantage

Faster AI, Less Lag: A Smarter Way to Process Language Models

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

LieBN on SPD Manifolds: The Additional Details and Experiments That You Don't Want to Miss

Complexity NIMBYs and the Illusion of Transcending Software Complexity Bounds

5 Practical Ways AI Can Boost Productivity for Web Developers

On the Concerns of Developers When Using GitHub Copilot

GitHub Copilot in Practice: Empirical Insights into User Experiences and Practical Challenges

LieBN on SPD Manifolds: The Additional Details and Experiments That You Don't Want to Miss

Complexity NIMBYs and the Illusion of Transcending Software Complexity Bounds

5 Practical Ways AI Can Boost Productivity for Web Developers

On the Concerns of Developers When Using GitHub Copilot

GitHub Copilot in Practice: Empirical Insights into User Experiences and Practical Challenges

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps