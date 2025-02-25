Authors: (1) Ben Athiwaratkun, AWS AI Labs; (2) Sujan Kumar Gonugondla, AWS AI Labs; (3) Sanjay Krishna Gouda, AWS AI Labs; (4) Haifeng Qian, AWS AI Labs; (5) Sanjay Krishna Gouda, AWS AI Labs; (6) Hantian Ding, AWS AI Labs; (7) Qing Sun, AWS AI Labs; (8) Jun Wang, AWS AI Labs; (9) Jiacheng Guo, AWS AI Labs; (10 Liangfu Chen, AWS AI Labs; (11) Parminder Bhatia, GE HealthCare (work done at AWS); (12) Ramesh Nallapati, Amazon AGI (work done at AWS); (13) Sudipta Sengupta, AWS AI Labs; (14) Bing Xiang, Goldman Sachs (work done at AWS).

B.1. Applications of Single-Context Batch Sampling

The observed latency reduction we achieve can have a profound impact on many applications. Some of these applications include:





• Code Generation: In software development, AI-assisted code generation can benefit greatly from reduced latency, especially when generating multiple code snippets or suggestions for a given context. This can lead to a more responsive and efficient user experience for developers using AI-powered Integrated Development Environments (IDEs) or code completion tools (Nijkamp et al., 2023; 2022; Chen et al., 2021; Le et al., 2022; Fried et al., 2022; Li et al., 2022; Allal et al., 2023; Li et al., 2023; Ahmad et al., 2021).





• Machine Translation: In situations where multiple translations are needed for a single input, such as generating translations with varying degrees of formality or generating translations for different dialects, the context-aware bifurcated attention can provide more efficient computation, resulting in faster and more scalable machine translation services (Costajussà et al., 2022; Farhad et al., 2021; Tran et al., 2021; Yee et al., 2019).





• Chatbots and Conversational AI: Conversational agents often need to generate multiple responses to handle different interpretations of a user’s input or to provide multiple suggestions. The reduced latency offered by the proposed method can significantly improve the responsiveness of chatbots, leading to a more natural and fluid conversation with users (Google, 2023).





• Creative Content Generation: In applications like poetry, story, or advertisement generation, the ability to generate multiple variations for a given prompt is crucial. The proposed method enables more efficient generation of diverse content, making it more feasible for real-time or large-scale applications (Lin and Riedl, 2021; Mirowski et al., 2023; Team, 2023; Yuan et al., 2022).





• Data Augmentation: In the context of data augmentation for machine learning, generating multiple alternative examples for a given input can help improve model robustness and generalization. With the reduced latency provided by context-aware bifurcated attention, the process of generating augmented data can be made faster, enabling more efficient use of computational resources during training.





• General Large Scale Evaluation: In addition to the aforementioned use-cases there are many niche use-cases where LLM and other open-ended generation models are explored for toxicity (Dathathri et al., 2019; Gehman et al., 2020; Nadeem et al., 2020), detection of vulnerable code in generations (Pearce et al., 2022), performance improving code edit generation (Madaan et al., 2023), programming language translations (Roziere et al., 2020) and many others. In all of these scenarios many generations per each prompt are gathered for a deeper understanding of the models, bifurcated attention can drastically speed up the generation process in such cases.





In conclusion, the proposed context-aware bifurcated attention method can significantly reduce memory I/O cost and improve latency in various applications, leading to increased efficiency and scalability. This method has the potential to enable new use cases and enhance the user experience in numerous AI-powered systems, making them more practical for real-world deployment.

B.2. Supporting Long Context Requires IO-Efficient Attention

As language models are becoming general purpose and highly capable, the demand for language models to handle longer context sequences has grown significantly. Recently, there is an ongoing focus on models that can handle even longer context sequences (Bulatov et al., 2023; OpenAI, 2023; Team, 2023). As of today, GPT-4 (OpenAI, 2023) supports context length of 32k tokens, and MPT-7B (Team, 2023) extends it to 64k while Anthropic’s Claude [3] supports as long as 100k input length. Most recently, Bulatov et al proposed 1M token input context length for transformers. These models push the boundaries of context understanding and generation capabilities, enabling more comprehensive discourse understanding and contextually informed responses.





This trend is driven by the need for comprehensive discourse understanding in applications like Retrieval-Augmented Generation (RAG), as well as many complex prompting methods. Applications such as RAG (Guu et al., 2020; Izacard et al., 2022; Menick et al., 2022; Zhen et al., 2022) retrieve extensive passages or documents from external corpora, providing rich and grounded context for generating responses. Additionally, models like Toolformer (Schick et al., 2023) and WebGPT (Nakano et al., 2021) leverage external tools, such as APIs and search engines, to expand the context and enhance generation.





Long context is disproportionately expensive for transformer family models because for vanilla self-attention both memory and time complexity are quadratic to the sequence length. To effectively handle longer context sequences, optimizing memory I/O and reducing computational overhead are critical. Currently, the dominant approaches to addressing this challenge have been to make the attention computation less expensive. Beltagy et al. (2020) proposed to sparsify self-attention using various attention patterns. Wang et al. (2020) explores low-rank approximation of self-attention. In addition to the compute bound improvements, advancements in memory-efficient attention mechanisms and techniques for reducing memory I/O will continue to propel the field forward, facilitating the handling of longer context sequences in language models. FlashAttention (Dao et al., 2022) is proposed to speed up self-attention and reduce the memory footprint without any approximation. It leverages fused kernel for matrix multiplication and softmax operation which greatly reduces memory IO during training.





