Table of Links Abstract and 1 Introduction 2 Background and 2.1 Transformer-Based Large Language Models 2.2 LLM Service & Autoregressive Generation 2.3 Batching Techniques for LLMs 3 Memory Challenges in LLM Serving 3.1 Memory Management in Existing Systems 4 Method and 4.1 PagedAttention 4.2 KV Cache Manager 4.3 Decoding with PagedAttention and vLLM 4.4 Application to Other Decoding Scenarios 4.5 Scheduling and Preemption 4.6 Distributed Execution 5 Implementation 6 Evaluation and 6.1 Experimental Setup 6.2 Basic Sampling 6.3 Parallel Sampling and Beam Search 6.4 Shared prefix 6.5 Chatbot 7 Ablation Studies 8 Discussion 9 Related Work 10 Conclusion, Acknowledgement and References 4.3 Decoding with PagedAttention and vLLM Next, we walk through an example, as in Fig. 6, to demonstrate how vLLM executes PagedAttention and manages the memory during the decoding process of a single input sequence: 1 As in OS’s virtual memory, vLLM does not require reserving the memory for the maximum possible generated sequence length initially. Instead, it reserves only the necessary KV blocks to accommodate the KV cache generated during prompt computation. In this case, The prompt has 7 tokens, so vLLM maps the first 2 logical KV blocks (0 and 1) to 2 physical KV blocks (7 and 1, respectively). In the prefill step, vLLM generates the KV cache of the prompts and the first output token with a conventional self-attention algorithm (e.g., [13]). vLLM then stores the KV cache of the first 4 tokens in logical block 0 and the following 3 tokens in logical block 1. The remaining slot is reserved for the subsequent autoregressive generation phase. 2 In the first autoregressive decoding step, vLLM generates the new token with the PagedAttention algorithm on physical blocks 7 and 1. Since one slot remains available in the last logical block, the newly generated KV cache is stored there, and the block table’s #filled record is updated. 3 At the second decoding step, as the last logical block is full, vLLM stores the newly generated KV cache in a new logical block; vLLM allocates a new physical block (physical block 3) for it and stores this mapping in the block table. Globally, for each decoding iteration, vLLM first selects a set of candidate sequences for batching (more in §4.5), and allocates the physical blocks for the newly required logical blocks. Then, vLLM concatenates all the input tokens of the current iteration (i.e., all tokens for prompt phase requests and the latest tokens for generation phase requests) as one sequence and feeds it into the LLM. During LLM’s computation, vLLM uses the PagedAttention kernel to access the previous KV cache stored in the form of logical KV blocks and saves the newly generated KV cache into the physical KV blocks. Storing multiple tokens within a KV block (block size > 1) enables the PagedAttention kernel to process the KV cache across more positions in parallel, thus increasing the hardware utilization and reducing latency. However, a larger block size also increases memory fragmentation. We study the effect of block size in §7.2. Again, vLLM dynamically assigns new physical blocks to logical blocks as more tokens and their KV cache are generated. As all the blocks are filled from left to right and a new physical block is only allocated when all previous blocks are full, vLLM limits all the memory wastes for a request within one block, so it can effectively utilize all the memory, as shown in Fig. 2. This allows more requests to fit into memory for batching—hence improving the throughput. Once a request finishes its generation, its KV blocks can be freed to store the KV cache of other requests. In Fig. 7, we show an example of vLLM managing the memory for two sequences. The logical blocks of the two sequences are mapped to different physical blocks within the space reserved by the block engine in GPU workers. The neighboring logical blocks of both sequences do not need to be contiguous in physical GPU memory and the space of physical blocks can be effectively utilized by both sequences. This paper is available on arxiv under CC BY 4.0 DEED license. Authors:
(1) Woosuk Kwon, UC Berkeley with Equal contribution;
(2) Zhuohan Li, UC Berkeley with Equal contribution;
(3) Siyuan Zhuang, UC Berkeley;
(4) Ying Sheng, UC Berkeley and Stanford University;
(5) Lianmin Zheng, UC Berkeley;
(6) Cody Hao Yu, Independent Researcher;
(7) Cody Hao Yu, Independent Researcher;
(8) Joseph E. Gonzalez, UC Berkeley;
(9) Hao Zhang, UC San Diego;
(10) Ion Stoica, UC Berkeley. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background and 2.1 Transformer-Based Large Language Models 2 Background and 2.1 Transformer-Based Large Language Models 2.2 LLM Service & Autoregressive Generation 2.2 LLM Service & Autoregressive Generation 2.3 Batching Techniques for LLMs 2.3 Batching Techniques for LLMs 3 Memory Challenges in LLM Serving 3 Memory Challenges in LLM Serving 3.1 Memory Management in Existing Systems 3.1 Memory Management in Existing Systems 4 Method and 4.1 PagedAttention 4 Method and 4.1 PagedAttention 4.2 KV Cache Manager 4.2 KV Cache Manager 4.3 Decoding with PagedAttention and vLLM 4.3 Decoding with PagedAttention and vLLM 4.4 Application to Other Decoding Scenarios 4.4 Application to Other Decoding Scenarios 4.5 Scheduling and Preemption 4.5 Scheduling and Preemption 4.6 Distributed Execution 4.6 Distributed Execution 5 Implementation 5 Implementation 6 Evaluation and 6.1 Experimental Setup 6 Evaluation and 6.1 Experimental Setup 6.2 Basic Sampling 6.2 Basic Sampling 6.3 Parallel Sampling and Beam Search 6.3 Parallel Sampling and Beam Search 6.4 Shared prefix 6.4 Shared prefix 6.5 Chatbot 6.5 Chatbot 7 Ablation Studies 7 Ablation Studies 8 Discussion 8 Discussion 9 Related Work 9 Related Work 10 Conclusion, Acknowledgement and References 10 Conclusion, Acknowledgement and References 4.3 Decoding with PagedAttention and vLLM Next, we walk through an example, as in Fig. 6, to demonstrate how vLLM executes PagedAttention and manages the memory during the decoding process of a single input sequence: 1 As in OS’s virtual memory, vLLM does not require reserving the memory for the maximum possible generated sequence length initially. Instead, it reserves only the necessary KV blocks to accommodate the KV cache generated during prompt computation. In this case, The prompt has 7 tokens, so vLLM maps the first 2 logical KV blocks (0 and 1) to 2 physical KV blocks (7 and 1, respectively). In the prefill step, vLLM generates the KV cache of the prompts and the first output token with a conventional self-attention algorithm (e.g., [13]). vLLM then stores the KV cache of the first 4 tokens in logical block 0 and the following 3 tokens in logical block 1. The remaining slot is reserved for the subsequent autoregressive generation phase. 2 In the first autoregressive decoding step, vLLM generates the new token with the PagedAttention algorithm on physical blocks 7 and 1. Since one slot remains available in the last logical block, the newly generated KV cache is stored there, and the block table’s #filled record is updated. 3 At the second decoding step, as the last logical block is full, vLLM stores the newly generated KV cache in a new logical block; vLLM allocates a new physical block (physical block 3) for it and stores this mapping in the block table. Globally, for each decoding iteration, vLLM first selects a set of candidate sequences for batching (more in §4.5), and allocates the physical blocks for the newly required logical blocks. Then, vLLM concatenates all the input tokens of the current iteration (i.e., all tokens for prompt phase requests and the latest tokens for generation phase requests) as one sequence and feeds it into the LLM. During LLM’s computation, vLLM uses the PagedAttention kernel to access the previous KV cache stored in the form of logical KV blocks and saves the newly generated KV cache into the physical KV blocks. Storing multiple tokens within a KV block (block size > 1) enables the PagedAttention kernel to process the KV cache across more positions in parallel, thus increasing the hardware utilization and reducing latency. However, a larger block size also increases memory fragmentation. We study the effect of block size in §7.2. Again, vLLM dynamically assigns new physical blocks to logical blocks as more tokens and their KV cache are generated. As all the blocks are filled from left to right and a new physical block is only allocated when all previous blocks are full, vLLM limits all the memory wastes for a request within one block, so it can effectively utilize all the memory, as shown in Fig. 2. This allows more requests to fit into memory for batching—hence improving the throughput. Once a request finishes its generation, its KV blocks can be freed to store the KV cache of other requests. In Fig. 7, we show an example of vLLM managing the memory for two sequences. The logical blocks of the two sequences are mapped to different physical blocks within the space reserved by the block engine in GPU workers. The neighboring logical blocks of both sequences do not need to be contiguous in physical GPU memory and the space of physical blocks can be effectively utilized by both sequences. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors: (1) Woosuk Kwon, UC Berkeley with Equal contribution; (2) Zhuohan Li, UC Berkeley with Equal contribution; (3) Siyuan Zhuang, UC Berkeley; (4) Ying Sheng, UC Berkeley and Stanford University; (5) Lianmin Zheng, UC Berkeley; (6) Cody Hao Yu, Independent Researcher; (7) Cody Hao Yu, Independent Researcher; (8) Joseph E. Gonzalez, UC Berkeley; (9) Hao Zhang, UC San Diego; (10) Ion Stoica, UC Berkeley. Authors: Authors: (1) Woosuk Kwon, UC Berkeley with Equal contribution; (2) Zhuohan Li, UC Berkeley with Equal contribution; (3) Siyuan Zhuang, UC Berkeley; (4) Ying Sheng, UC Berkeley and Stanford University; (5) Lianmin Zheng, UC Berkeley; (6) Cody Hao Yu, Independent Researcher; (7) Cody Hao Yu, Independent Researcher; (8) Joseph E. Gonzalez, UC Berkeley; (9) Hao Zhang, UC San Diego; (10) Ion Stoica, UC Berkeley.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Decoding With PagedAttention and vLLM

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

How to Convert Base64 Data to Files on Linux and Mac OS

WTF is a Codec?

How to Convert a String to Byte Array in C#

Memory Challenges in LLM Serving: The Obstacles to Overcome

Why Decoding Uniswap V4 is Harder Than V2 & V3

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

How to Convert Base64 Data to Files on Linux and Mac OS

WTF is a Codec?

How to Convert a String to Byte Array in C#

Memory Challenges in LLM Serving: The Obstacles to Overcome

Why Decoding Uniswap V4 is Harder Than V2 & V3

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps