Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background 2.1 Large Language Models 2.1 Large Language Models 2.2 Fragmentation and PagedAttention 2.2 Fragmentation and PagedAttention 3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel 3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel 3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead 3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead 4 Insights into LLM Serving Systems 4 Insights into LLM Serving Systems 5 vAttention: System Design and 5.1 Design Overview 5 vAttention: System Design and 5.1 Design Overview 5.2 Leveraging Low-level CUDA Support 5.2 Leveraging Low-level CUDA Support 5.3 Serving LLMs with vAttention 5.3 Serving LLMs with vAttention 6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation 6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation 6.2 Hiding memory allocation latency 6.2 Hiding memory allocation latency 7 Evaluation 7 Evaluation 7.1 Portability and Performance for Prefills 7.1 Portability and Performance for Prefills 7.2 Portability and Performance for Decodes 7.2 Portability and Performance for Decodes 7.3 Efficacy of Physical Memory Allocation 7.3 Efficacy of Physical Memory Allocation 7.4 Analysis of Memory Fragmentation 7.4 Analysis of Memory Fragmentation 8 Related Work 8 Related Work 9 Conclusion and References 9 Conclusion and References Abstract Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager. Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency In this paper, we propose vAttention for dynamic KVcache memory management. In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation. Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework. We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels. vAttention also generates tokens up to 1.97× faster than vLLM, while processing input prompts up to 3.92× and 1.45× faster than the PagedAttention variants of FlashAttention and FlashInfer. 1 Introduction Large Language Models (LLMs) are being deployed in a wide range of applications e.g., chat bots, search engines and coding assistants [1–3, 5–7, 32, 40]. Optimizing LLM inference has thus become important [26, 36, 38, 39, 41, 47, 49]. One of the key techniques used to improve LLM serving throughput is batching [25, 39, 41, 47]. Out of the two phases of LLM inference – namely prefill and decode – the decode phase is memory-bound because it processes a single token at-atime per request. Batching amortizes the cost of fetching model weights from GPU memory and boosts throughput by improving memory bandwidth utilization [26]. Efficient inference also requires careful allocation of GPU memory. For every request, an LLM maintains an in-memory state known as the KV-cache and re-uses it in every iteration for the lifetime of the request [26, 41, 47]. Achieving high memory capacity utilization during inference is challenging for two reasons: 1) per-request KV-cache grows slowly i.e., one token per iteration (few 10s of milliseconds) and 2) the number of generated tokens, and hence the total size of a request’s KV-cache, is typically not known ahead-of-time. Prior systems like Orca [47] and FasterTransformer [18] allocate a contiguous chunk of virtual memory (backed by pre-allocated physical memory) for the KV-cache. The allocated size corresponded to model’s maximum supported sequence length, e.g., 32K. Since models often generate far fewer tokens than the maximum limit, significant GPU memory was wasted due to internal fragmentation. Consequently, these systems exhibit poor throughput as they are unable to support large batch sizes. Inspired by demand paging in OS-based virtual memory systems, vLLM introduced PagedAttention [39] to mitigate KV-cache related memory fragmentation. Instead of reserving the maximum sequence length of KV-cache memory ahead-of-time, vLLM allocates small blocks of virtual memory (backed by physical memory) on-demand i.e., when previously allocated blocks are fully utilized and the model continues to generate more tokens. However, dynamically allocated blocks are not guaranteed to be contiguous in virtual memory (the system may have allocated those to other requests). Thus, PagedAttention accepts that KV-cache allocations will be non-contiguous and implements a block-table to stitch together these non-contiguous allocations (§3.2). Today, PagedAttention has become the de facto standard for dynamic memory allocation in LLM serving systems e.g., in TensorRT-LLM [14], HuggingFace TGI [8], FlashInfer [46], LightLLM [12] etc. The most notable aspect of the PagedAttention approach is that it stores KV-cache in noncontiguous virtual memory to be able to allocate physical memory dynamically. While this approach provides an adequate solution for KV-cache fragmentation, we argue that it has several pitfalls (see Table 1 for empirical evidence and real-world experiences): virtual physical 1. Requires re-writing the attention kernel (GPU code). The elements of a virtually contiguous object can be accessed using index-based lookup which is both simple and efficient[1]. By storing KV-cache in non-contiguous virtual memory, PagedAttention mandates re-writing GPU code so that the attention kernel can de-reference all the elements of KV-cache. The need to re-write code is a major barrier to using new attention optimizations in production settings. 1. Requires re-writing the attention kernel (GPU code). 2. Adds software complexity and redundancy (CPU code). PagedAttention also forces the developers to implement a memory manager inside the serving framework, making it responsible for (de)allocating KV-cache and tracking the location of dynamically allocated KV-cache blocks. This approach essentially translates to re-implementing demand paging – which is an OS functionality – in user code. 3. Introduces performance overhead. PagedAttention can add runtime overhead in the critical path of execution in two ways. First, it requires GPU kernels to execute extra code related to fetching KV-cache from non-contiguous memory blocks. We show that this can slow down attention computation by more than 10% in many cases. Second, the user space memory manager can add CPU overhead, contributing up to another 10% cost (§3.3). 3. Introduces performance overhead. In this paper, we argue that retaining the virtual memory contiguity of KV-cache is critical for reducing software complexity and redundancy in LLM deployments. Instead of reimplementing paging at the user-level, we contend that existing virtual memory abstractions in the OS can be re-purposed for dynamic KV-cache memory management, resulting in simplified deployments as well as higher performance. To demonstrate this, we propose vAttention – a system that stores KV-cache in contiguous virtual memory without committing physical memory ahead-of-time. We achieve this by leveraging CUDA support of low-level virtual memory APIs which expose distinct interfaces for allocating virtual and physical memory (§5). vAttention exposes a set of simple APIs using which a serving framework reserves contiguous virtual space for the KV-cache and allocates physical memory on-demand. This approach lends several benefits as listed in Table 2. vAttention also improves portability by enabling a seamless re-use of readily available GPU kernels while eliminating the need to implement a memory manager in a serving system. Challenges and Optimizations: vAttention solves two key challenges in enabling efficient dynamic memory management without PagedAttention (§6). First, the minimum physical memory allocation granularity supported by CUDA APIs is 2MB. This size can result in significant wasted capacity depending on the model and workload characteristics (Table 8). To address this, we modify the open-source CUDA unified virtual memory driver to add support for finer-grained physical memory allocations of 64KB to 256KB. Second, memory allocation using CUDA APIs incurs high latency because each allocation involves a round-trip to the kernel. To hide the latency of memory allocation from end-users, we introduce several LLM-specific optimizations such as overlapping memory allocation with compute, executing some operations ahead of time and deferring memory reclamation. We show that our optimizations make vAttention an efficient KV-cache memory manager. Challenges and Optimizations: Overall, we make the following contributions in this paper: • We present vAttention, a system that retains the virtual contiguity of KV-cache while enabling dynamic allocation of physical memory. • We show that vAttention is able to seamlessly add dynamic memory management support to unmodified attention kernels of FlashAttention [9] and FlashInfer [11] while also being performant. • We evaluate Yi-6B, Llama-3-8B and Yi-34B on 1-2 A100 GPUs and show that using FlashAttention’s original kernel, vAttention outperforms vLLM by up to 1.97×, while reducing the time-to-first-token (TTFT) by up to 1.45× over the PagedAttention variant of FlashInfer. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv [1] A popular example of this is arrays vs. linked lists. Authors:
(1) Ramya Prabhu, Microsoft Research India;
(2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India. Authors: Authors: (1) Ramya Prabhu, Microsoft Research India; (2) Ajay Nayak, Indian Institute of Science and Contributed to this work as an intern at Microsoft Research India; (3) Jayashree Mohan, Microsoft Research India; (4) Ramachandran Ramjee, Microsoft Research India; (5) Ashish Panwar, Microsoft Research India.

Microsoft

vAttention: Contiguous KV-Cache for Faster, Simpler LLM Inference

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A New Model That Actually Works: Why Today’s AI Detectors Fail

Anchor-based Large Language Models

How Anchor Tokens Transform Sequence Information Compression in LLMs

Unlocking the Mechanics of Decoder-Only Transformers and Self-Attention

The Role of Anchor Tokens in Self-Attention Networks

Improving Real-Time Inference with Anchor Tokens

A New Model That Actually Works: Why Today’s AI Detectors Fail

Anchor-based Large Language Models

How Anchor Tokens Transform Sequence Information Compression in LLMs

Unlocking the Mechanics of Decoder-Only Transformers and Self-Attention

The Role of Anchor Tokens in Self-Attention Networks

Improving Real-Time Inference with Anchor Tokens

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps