178 reads

Our Method for Developing PagedAttention

by Writings, Papers and Blogs on Text ModelsDecember 28th, 2024

Too Long; Didn't Read

In this work, we develop a new attention algorithm, PagedAttention, and build an LLM serving engine, vLLM, to tackle the challenges outlined in §3

featured image - Our Method for Developing PagedAttention

‘an algorithm being constructed’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2 Background and 2.1 Transformer-Based Large Language Models

2.2 LLM Service & Autoregressive Generation

2.3 Batching Techniques for LLMs

3 Memory Challenges in LLM Serving

3.1 Memory Management in Existing Systems

4 Method and 4.1 PagedAttention

4.2 KV Cache Manager

4.3 Decoding with PagedAttention and vLLM

4.4 Application to Other Decoding Scenarios

4.5 Scheduling and Preemption

4.6 Distributed Execution

5 Implementation

6 Evaluation and 6.1 Experimental Setup

6.2 Basic Sampling

6.3 Parallel Sampling and Beam Search

4 Method

In this work, we develop a new attention algorithm, PagedAttention, and build an LLM serving engine, vLLM, to tackle the challenges outlined in §3. The architecture of vLLM is shown in Fig. 4. vLLM adopts a centralized scheduler to coordinate the execution of distributed GPU workers. The KV cache manager effectively manages the KV cache in a paged fashion, enabled by PagedAttention. Specifically, the KV cache manager manages the physical KV cache memory on the GPU workers through the instructions sent by the centralized scheduler.

Next, We describe the PagedAttention algorithm in §4.1. With that, we show the design of the KV cache manager in §4.2 and how it facilitates PagedAttention in §4.3, respectively. Then, we show how this design facilitates effective memory management for various decoding methods (§4.4) and handles the variable length input and output sequences (§4.5). Finally, we show how the system design of vLLM works in a distributed setting (§4.6).

4.1 PagedAttention

To address the memory challenges in §3, we introduce PagedAttention, an attention algorithm inspired by the classic idea of paging [25] in operating systems. Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into KV blocks. Each block contains the key and value vectors for a fixed number of tokens,[1] which we denote as KV

block size (𝐵). Denote the key block 𝐾𝑗 = (𝑘(𝑗−1)𝐵+1, . . . , 𝑘𝑗𝐵) and value block 𝑉𝑗 = (𝑣(𝑗−1)𝐵+1, . . . , 𝑣𝑗𝐵). The attention computation in Eq. 4 can be transformed into the following block-wise computation:

where 𝐴𝑖𝑗 = (𝑎𝑖,(𝑗−1)𝐵+1, . . . , 𝑎𝑖,𝑗𝐵) is the row vector of attention score on 𝑗-th KV block.

During the attention computation, the PagedAttention kernel identifies and fetches different KV blocks separately. We show an example of PagedAttention in Fig. 5: The key and value vectors are spread across three blocks, and the three blocks are not contiguous on the physical memory. At each time, the kernel multiplies the query vector 𝑞𝑖 of the query token (“forth”) and the key vectors 𝐾𝑗 in a block (e.g., key vectors of “Four score and seven” for block 0) to compute the attention score𝐴𝑖𝑗, and later multiplies𝐴𝑖𝑗 with the value vectors 𝑉𝑗 in a block to derive the final attention output 𝑜𝑖.

In summary, the PagedAttention algorithm allows the KV blocks to be stored in non-contiguous physical memory, which enables more flexible paged memory management in vLLM.

This paper is available on arxiv under CC BY 4.0 DEED license.

[1] In Transformer, each token has a set of key and value vectors across layers and attention heads within a layer. All the key and value vectors can be managed together within a single KV block, or the key and value vectors at different heads and layers can each have a separate block and be managed in separate block tables. The two designs have no performance difference and we choose the second one for easy implementation.

Authors:

(1) Woosuk Kwon, UC Berkeley with Equal contribution;

(2) Zhuohan Li, UC Berkeley with Equal contribution;

(3) Siyuan Zhuang, UC Berkeley;

(4) Ying Sheng, UC Berkeley and Stanford University;

(5) Lianmin Zheng, UC Berkeley;

(6) Cody Hao Yu, Independent Researcher;

(7) Cody Hao Yu, Independent Researcher;

(8) Joseph E. Gonzalez, UC Berkeley;

(9) Hao Zhang, UC San Diego;

(10) Ion Stoica, UC Berkeley.

L O A D I N G
. . . comments & more!

About Author

Writings, Papers and Blogs on Text Models@textmodels

We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.

Read my stories About @textmodels

TOPICS

machine-learning #llms #pagedattention #vllm #llm-serving-engine #kv-cache #memory-management #memory-challenges #kv-blocks

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Our Method for Developing PagedAttention

Too Long; Didn't Read

Table of Links

4 Method

4.1 PagedAttention

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES