2 Background and 2.1 Transformer-Based Large Language Models
2.2 LLM Service & Autoregressive Generation
2.3 Batching Techniques for LLMs
3 Memory Challenges in LLM Serving
3.1 Memory Management in Existing Systems
4 Method and 4.1 PagedAttention
4.3 Decoding with PagedAttention and vLLM
4.4 Application to Other Decoding Scenarios
6 Evaluation and 6.1 Experimental Setup
6.3 Parallel Sampling and Beam Search
10 Conclusion, Acknowledgement and References
Once trained, LLMs are often deployed as a conditional generation service (e.g., completion API [34] or chatbot [19, 35]). A request to an LLM service provides a list of input prompt tokens (๐ฅ1, . . . , ๐ฅ๐), and the LLM service generates a list of output tokens (๐ฅ๐+1, . . . , ๐ฅ๐+๐ ) according to Eq. 1. We refer to the concatenation of the prompt and output lists as sequence.
Due to the decomposition in Eq. 1, the LLM can only sample and generate new tokens one by one, and the generation process of each new token depends on all the previous tokens in that sequence, specifically their key and value vectors. In this sequential generation process, the key and value vectors of existing tokens are often cached for generating future tokens, known as KV cache. Note that the KV cache of one token depends on all its previous tokens. This means that the KV cache of the same token appearing at different positions in a sequence will be different.
Given a request prompt, the generation computation in the LLM service can be decomposed into two phases:
The prompt phase takes the whole user prompt (๐ฅ1, . . . , ๐ฅ๐) as input and computes the probability of the first new token ๐ (๐ฅ๐+1 | ๐ฅ1, . . . , ๐ฅ๐). During this process, also generates the key vectors ๐1, . . . , ๐๐ and value vectors ๐ฃ1, . . . , ๐ฃ๐. Since prompt tokens ๐ฅ1, . . . , ๐ฅ๐ are all known, the computation of the prompt phase can be parallelized using matrixmatrix multiplication operations. Therefore, this phase can efficiently use the parallelism inherent in GPUs.
The autoregressive generation phase generates the remaining new tokens sequentially. At iteration ๐ก, the model takes one token ๐ฅ๐+๐ก as input and computes the probability ๐ (๐ฅ๐+๐ก+1 | ๐ฅ1, . . . , ๐ฅ๐+๐ก) with the key vectors ๐1, . . . , ๐๐+๐ก and value vectors๐ฃ1, . . . , ๐ฃ๐+๐ก . Note that the key and value vectors at positions 1 to ๐ + ๐ก โ 1 are cached at previous iterations, only the new key and value vector ๐๐+๐ก and ๐ฃ๐+๐ก are computed at this iteration. This phase completes either when the sequence reaches a maximum length (specified by users or limited by LLMs) or when an end-of-sequence (<eos>) token is emitted. The computation at different iterations cannot be parallelized due to the data dependency and often uses matrix-vector multiplication, which is less efficient. As a result, this phase severely underutilizes GPU computation and becomes memory-bound, being responsible for most portion of the latency of a single request.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Woosuk Kwon, UC Berkeley with Equal contribution;
(2) Zhuohan Li, UC Berkeley with Equal contribution;
(3) Siyuan Zhuang, UC Berkeley;
(4) Ying Sheng, UC Berkeley and Stanford University;
(5) Lianmin Zheng, UC Berkeley;
(6) Cody Hao Yu, Independent Researcher;
(7) Cody Hao Yu, Independent Researcher;
(8) Joseph E. Gonzalez, UC Berkeley;
(9) Hao Zhang, UC San Diego;
(10) Ion Stoica, UC Berkeley.