Table of Links Abstract and 1 Introduction 2 Background and 2.1 Transformer-Based Large Language Models 2.2 LLM Service & Autoregressive Generation 2.3 Batching Techniques for LLMs 3 Memory Challenges in LLM Serving 3.1 Memory Management in Existing Systems 4 Method and 4.1 PagedAttention 4.2 KV Cache Manager 4.3 Decoding with PagedAttention and vLLM 4.4 Application to Other Decoding Scenarios 4.5 Scheduling and Preemption 4.6 Distributed Execution 5 Implementation 6 Evaluation and 6.1 Experimental Setup 6.2 Basic Sampling 6.3 Parallel Sampling and Beam Search 6.4 Shared prefix 6.5 Chatbot 7 Ablation Studies 8 Discussion 9 Related Work 10 Conclusion, Acknowledgement and References 6 Evaluation In this section, we evaluate the performance of vLLM under a variety of workloads. 6.1 Experimental Setup Model and server configurations. We use OPT [62] models with 13B, 66B, and 175B parameters and LLaMA [52] with 13B parameters for our evaluation. 13B and 66B are popular sizes for LLMs as shown in an LLM leaderboard [38], while 175B is the size of the famous GPT-3 [5] model. For all of our experiments, we use A2 instances with NVIDIA A100 GPUs on Google Cloud Platform. The detailed model sizes and server configurations are shown in Table 1. Workloads. We synthesize workloads based on ShareGPT [51] and Alpaca [50] datasets, which contain input and output texts of real LLM services. The ShareGPT dataset is a collection of user-shared conversations with ChatGPT [35]. The Alpaca dataset is an instruction dataset generated by GPT3.5 with self-instruct [57]. We tokenize the datasets and use their input and output lengths to synthesize client requests. As shown in Fig. 11, the ShareGPT dataset has 8.4× longer input prompts and 5.8× longer outputs on average than the Alpaca dataset, with higher variance. Since these datasets do not include timestamps, we generate request arrival times using Poisson distribution with different request rates. Baseline 1: FasterTransformer. FasterTransformer [31] is a distributed inference engine highly optimized for latency. As FasterTransformer does not have its own scheduler, we implement a custom scheduler with a dynamic batching mechanism similar to the existing serving systems such as Triton [30]. Specifically, we set a maximum batch size 𝐵 as large as possible for each experiment, according to the GPU memory capacity. The scheduler takes up to 𝐵 number of earliest arrived requests and sends the batch to FasterTransformer for processing. Baseline 2: Orca. Orca [60] is a state-of-the-art LLM serving system optimized for throughput. Since Orca is not publicly available for use, we implement our own version of Orca. We assume Orca uses the buddy allocation algorithm to determine the memory address to store KV cache. We implement three versions of Orca based on how much it over-reserves the space for request outputs: • Orca (Oracle). We assume the system has the knowledge of the lengths of the outputs that will be actually generated for the requests. This shows the upper-bound performance of Orca, which is infeasible to achieve in practice. • Orca (Pow2). We assume the system over-reserves the space for outputs by at most 2×. For example, if the true output length is 25, it reserves 32 positions for outputs. • Orca (Max). We assume the system always reserves the space up to the maximum sequence length of the model, i.e., 2048 tokens. Key metrics. We focus on serving throughput. Specifically, using the workloads with different request rates, we measure normalized latency of the systems, the mean of every request’s end-to-end latency divided by its output length, as in Orca [60]. A high-throughput serving system should retain low normalized latency against high request rates. For most experiments, we evaluate the systems with 1-hour traces. As an exception, we use 15-minute traces for the OPT-175B model due to the cost limit. This paper is available on arxiv under CC BY 4.0 DEED license. Authors:
(1) Woosuk Kwon, UC Berkeley with Equal contribution;
(2) Zhuohan Li, UC Berkeley with Equal contribution;
(3) Siyuan Zhuang, UC Berkeley;
(4) Ying Sheng, UC Berkeley and Stanford University;
(5) Lianmin Zheng, UC Berkeley;
(6) Cody Hao Yu, Independent Researcher;
(7) Cody Hao Yu, Independent Researcher;
(8) Joseph E. Gonzalez, UC Berkeley;
(9) Hao Zhang, UC San Diego;
(10) Ion Stoica, UC Berkeley. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background and 2.1 Transformer-Based Large Language Models 2 Background and 2.1 Transformer-Based Large Language Models 2.2 LLM Service & Autoregressive Generation 2.2 LLM Service & Autoregressive Generation 2.3 Batching Techniques for LLMs 2.3 Batching Techniques for LLMs 3 Memory Challenges in LLM Serving 3 Memory Challenges in LLM Serving 3.1 Memory Management in Existing Systems 3.1 Memory Management in Existing Systems 4 Method and 4.1 PagedAttention 4 Method and 4.1 PagedAttention 4.2 KV Cache Manager 4.2 KV Cache Manager 4.3 Decoding with PagedAttention and vLLM 4.3 Decoding with PagedAttention and vLLM 4.4 Application to Other Decoding Scenarios 4.4 Application to Other Decoding Scenarios 4.5 Scheduling and Preemption 4.5 Scheduling and Preemption 4.6 Distributed Execution 4.6 Distributed Execution 5 Implementation 5 Implementation 6 Evaluation and 6.1 Experimental Setup 6 Evaluation and 6.1 Experimental Setup 6.2 Basic Sampling 6.2 Basic Sampling 6.3 Parallel Sampling and Beam Search 6.3 Parallel Sampling and Beam Search 6.4 Shared prefix 6.4 Shared prefix 6.5 Chatbot 6.5 Chatbot 7 Ablation Studies 7 Ablation Studies 8 Discussion 8 Discussion 9 Related Work 9 Related Work 10 Conclusion, Acknowledgement and References 10 Conclusion, Acknowledgement and References 6 Evaluation In this section, we evaluate the performance of vLLM under a variety of workloads. 6.1 Experimental Setup Model and server configurations. We use OPT [62] models with 13B, 66B, and 175B parameters and LLaMA [52] with 13B parameters for our evaluation. 13B and 66B are popular sizes for LLMs as shown in an LLM leaderboard [38], while 175B is the size of the famous GPT-3 [5] model. For all of our experiments, we use A2 instances with NVIDIA A100 GPUs on Google Cloud Platform. The detailed model sizes and server configurations are shown in Table 1. Model and server configurations. Workloads . We synthesize workloads based on ShareGPT [51] and Alpaca [50] datasets, which contain input and output texts of real LLM services. The ShareGPT dataset is a collection of user-shared conversations with ChatGPT [35]. The Alpaca dataset is an instruction dataset generated by GPT3.5 with self-instruct [57]. We tokenize the datasets and use their input and output lengths to synthesize client requests. As shown in Fig. 11, the ShareGPT dataset has 8.4× longer input prompts and 5.8× longer outputs on average than the Alpaca dataset, with higher variance. Since these datasets do not include timestamps, we generate request arrival times using Poisson distribution with different request rates. Workloads Baseline 1: FasterTransformer. FasterTransformer [31] is a distributed inference engine highly optimized for latency. As FasterTransformer does not have its own scheduler, we implement a custom scheduler with a dynamic batching mechanism similar to the existing serving systems such as Triton [30]. Specifically, we set a maximum batch size 𝐵 as large as possible for each experiment, according to the GPU memory capacity. The scheduler takes up to 𝐵 number of earliest arrived requests and sends the batch to FasterTransformer for processing. Baseline 1: Baseline 2: Orca. Orca [60] is a state-of-the-art LLM serving system optimized for throughput. Since Orca is not publicly available for use, we implement our own version of Orca. We assume Orca uses the buddy allocation algorithm to determine the memory address to store KV cache. We implement three versions of Orca based on how much it over-reserves the space for request outputs: Baseline 2: • Orca (Oracle). We assume the system has the knowledge of the lengths of the outputs that will be actually generated for the requests. This shows the upper-bound performance of Orca, which is infeasible to achieve in practice. • Orca (Oracle). • Orca (Pow2) . We assume the system over-reserves the space for outputs by at most 2×. For example, if the true output length is 25, it reserves 32 positions for outputs. • Orca (Pow2) • Orca (Max). We assume the system always reserves the space up to the maximum sequence length of the model, i.e., 2048 tokens. • Orca (Max). Key metrics. We focus on serving throughput. Specifically, using the workloads with different request rates, we measure normalized latency of the systems, the mean of every request’s end-to-end latency divided by its output length, as in Orca [60]. A high-throughput serving system should retain low normalized latency against high request rates. For most experiments, we evaluate the systems with 1-hour traces. As an exception, we use 15-minute traces for the OPT-175B model due to the cost limit. Key metrics. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors: (1) Woosuk Kwon, UC Berkeley with Equal contribution; (2) Zhuohan Li, UC Berkeley with Equal contribution; (3) Siyuan Zhuang, UC Berkeley; (4) Ying Sheng, UC Berkeley and Stanford University; (5) Lianmin Zheng, UC Berkeley; (6) Cody Hao Yu, Independent Researcher; (7) Cody Hao Yu, Independent Researcher; (8) Joseph E. Gonzalez, UC Berkeley; (9) Hao Zhang, UC San Diego; (10) Ion Stoica, UC Berkeley. Authors: Authors: (1) Woosuk Kwon, UC Berkeley with Equal contribution; (2) Zhuohan Li, UC Berkeley with Equal contribution; (3) Siyuan Zhuang, UC Berkeley; (4) Ying Sheng, UC Berkeley and Stanford University; (5) Lianmin Zheng, UC Berkeley; (6) Cody Hao Yu, Independent Researcher; (7) Cody Hao Yu, Independent Researcher; (8) Joseph E. Gonzalez, UC Berkeley; (9) Hao Zhang, UC San Diego; (10) Ion Stoica, UC Berkeley.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Evaluating the Performance of vLLM: How Did It Do?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

LLaMA-v2-Chat vs Alpaca: A Guide to Know When to Use Each Model

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

The Noonification: Exploring Tool-Integrated Reasoning: Innovating Math-Proficient LLMs (10/5/2023)

The Noonification: Simple Database Migration Scripts On Your CI/CD step (10/16/2023)

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

LLaMA-v2-Chat vs Alpaca: A Guide to Know When to Use Each Model

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

The Noonification: Exploring Tool-Integrated Reasoning: Innovating Math-Proficient LLMs (10/5/2023)

The Noonification: Simple Database Migration Scripts On Your CI/CD step (10/16/2023)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps