250 reads

LLM Service & Autoregressive Generation: What This Means

by Writings, Papers and Blogs on Text ModelsDecember 14th, 2024

Too Long; Didn't Read

Once trained, LLMs are often deployed as a conditional generation service (e.g., completion API [34] or chatbot.

featured image - LLM Service & Autoregressive Generation: What This Means

‘time going backwards’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2 Background and 2.1 Transformer-Based Large Language Models

2.2 LLM Service & Autoregressive Generation

2.3 Batching Techniques for LLMs

3 Memory Challenges in LLM Serving

3.1 Memory Management in Existing Systems

4 Method and 4.1 PagedAttention

4.2 KV Cache Manager

4.3 Decoding with PagedAttention and vLLM

4.4 Application to Other Decoding Scenarios

4.5 Scheduling and Preemption

4.6 Distributed Execution

5 Implementation

6 Evaluation and 6.1 Experimental Setup

6.2 Basic Sampling

6.3 Parallel Sampling and Beam Search

2.2 LLM Service & Autoregressive Generation

Once trained, LLMs are often deployed as a conditional generation service (e.g., completion API [34] or chatbot [19, 35]). A request to an LLM service provides a list of input prompt tokens (𝑥1, . . . , 𝑥𝑛), and the LLM service generates a list of output tokens (𝑥𝑛+1, . . . , 𝑥𝑛+𝑇 ) according to Eq. 1. We refer to the concatenation of the prompt and output lists as sequence.

Due to the decomposition in Eq. 1, the LLM can only sample and generate new tokens one by one, and the generation process of each new token depends on all the previous tokens in that sequence, specifically their key and value vectors. In this sequential generation process, the key and value vectors of existing tokens are often cached for generating future tokens, known as KV cache. Note that the KV cache of one token depends on all its previous tokens. This means that the KV cache of the same token appearing at different positions in a sequence will be different.

Given a request prompt, the generation computation in the LLM service can be decomposed into two phases:

The prompt phase takes the whole user prompt (𝑥1, . . . , 𝑥𝑛) as input and computes the probability of the first new token 𝑃 (𝑥𝑛+1 | 𝑥1, . . . , 𝑥𝑛). During this process, also generates the key vectors 𝑘1, . . . , 𝑘𝑛 and value vectors 𝑣1, . . . , 𝑣𝑛. Since prompt tokens 𝑥1, . . . , 𝑥𝑛 are all known, the computation of the prompt phase can be parallelized using matrixmatrix multiplication operations. Therefore, this phase can efficiently use the parallelism inherent in GPUs.

The autoregressive generation phase generates the remaining new tokens sequentially. At iteration 𝑡, the model takes one token 𝑥𝑛+𝑡 as input and computes the probability 𝑃 (𝑥𝑛+𝑡+1 | 𝑥1, . . . , 𝑥𝑛+𝑡) with the key vectors 𝑘1, . . . , 𝑘𝑛+𝑡 and value vectors𝑣1, . . . , 𝑣𝑛+𝑡 . Note that the key and value vectors at positions 1 to 𝑛 + 𝑡 − 1 are cached at previous iterations, only the new key and value vector 𝑘𝑛+𝑡 and 𝑣𝑛+𝑡 are computed at this iteration. This phase completes either when the sequence reaches a maximum length (specified by users or limited by LLMs) or when an end-of-sequence (<eos>) token is emitted. The computation at different iterations cannot be parallelized due to the data dependency and often uses matrix-vector multiplication, which is less efficient. As a result, this phase severely underutilizes GPU computation and becomes memory-bound, being responsible for most portion of the latency of a single request.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Woosuk Kwon, UC Berkeley with Equal contribution;

(2) Zhuohan Li, UC Berkeley with Equal contribution;

(3) Siyuan Zhuang, UC Berkeley;

(4) Ying Sheng, UC Berkeley and Stanford University;

(5) Lianmin Zheng, UC Berkeley;

(6) Cody Hao Yu, Independent Researcher;

(7) Cody Hao Yu, Independent Researcher;

(8) Joseph E. Gonzalez, UC Berkeley;

(9) Hao Zhang, UC San Diego;

(10) Ion Stoica, UC Berkeley.

L O A D I N G
. . . comments & more!

About Author

Writings, Papers and Blogs on Text Models@textmodels

We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.

Read my stories About @textmodels

TOPICS

machine-learning #llms #llm-service #autoregressive-generation #end-of-sequence #matrix-multiplication #pagedattention #generation-computation #gpu-computation

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

LLM Service & Autoregressive Generation: What This Means

Too Long; Didn't Read

Table of Links

2.2 LLM Service & Autoregressive Generation

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES