Table of Links Abstract and 1 Introduction 2 Background and 2.1 Transformer-Based Large Language Models 2.2 LLM Service & Autoregressive Generation 2.3 Batching Techniques for LLMs 3 Memory Challenges in LLM Serving 3.1 Memory Management in Existing Systems 4 Method and 4.1 PagedAttention 4.2 KV Cache Manager 4.3 Decoding with PagedAttention and vLLM 4.4 Application to Other Decoding Scenarios 4.5 Scheduling and Preemption 4.6 Distributed Execution 5 Implementation 6 Evaluation and 6.1 Experimental Setup 6.2 Basic Sampling 6.3 Parallel Sampling and Beam Search 6.4 Shared prefix 6.5 Chatbot 7 Ablation Studies 8 Discussion 9 Related Work 10 Conclusion, Acknowledgement and References 9 Related Work General model serving systems. Model serving has been an active area of research in recent years, with numerous systems proposed to tackle diverse aspects of deep learning model deployment. Clipper [11], TensorFlow Serving [33], Nexus [45], InferLine [10], and Clockwork [20] are some earlier general model serving systems. They study batching, caching, placement, and scheduling for serving single or multiple models. More recently, DVABatch [12] introduces multi-entry multi-exit batching. REEF [21] and Shepherd [61] propose preemption for serving. AlpaServe [28] utilizes model parallelism for statistical multiplexing. However, these general systems fail to take into account the autoregressive property and token state of LLM inference, resulting in missed opportunities for optimization. Specialized serving systems for transformers. Due to the significance of the transformer architecture, numerous specialized serving systems for it have been developed. These systems utilize GPU kernel optimizations [1, 29, 31, 56], advanced batching mechanisms [14, 60], model parallelism [1, 41, 60], and parameter sharing [64] for efficient serving. Among them, Orca [60] is most relevant to our approach. Comparison to Orca. The iteration-level scheduling in Orca [60] and PagedAttention in vLLM are complementary techniques: While both systems aim to increase the GPU utilization and hence the throughput of LLM serving, Orca achieves it by scheduling and interleaving the requests so that more requests can be processed in parallel, while vLLM is doing so by increasing memory utilization so that the working sets of more requests fit into memory. By reducing memory fragmentation and enabling sharing, vLLM runs more requests in a batch in parallel and achieves a 2-4× speedup compared to Orca. Indeed, the fine-grained scheduling and interleaving of the requests like in Orca makes memory management more challenging, making the techniques proposed in vLLM even more crucial. Memory optimizations. The widening gap between the compute capability and memory capacity of accelerators has caused memory to become a bottleneck for both training and inference. Swapping [23, 42, 55], recomputation [7, 24] and their combination [40] have been utilized to reduce the peak memory of training. Notably, FlexGen [46] studies how to swap weights and token states for LLM inference with limited GPU memory, but it does not target the online serving settings. OLLA [48] optimizes the lifetime and location of tensors to reduce fragmentation, but it does not do finegrained block-level management or online serving. FlashAttention [13] applies tiling and kernel optimizations to reduce the peak memory of attention computation and reduce I/O costs. This paper introduces a new idea of block-level memory management in the context of online serving. This paper is available on arxiv under CC BY 4.0 DEED license. Authors:
(1) Woosuk Kwon, UC Berkeley with Equal contribution;
(2) Zhuohan Li, UC Berkeley with Equal contribution;
(3) Siyuan Zhuang, UC Berkeley;
(4) Ying Sheng, UC Berkeley and Stanford University;
(5) Lianmin Zheng, UC Berkeley;
(6) Cody Hao Yu, Independent Researcher;
(7) Cody Hao Yu, Independent Researcher;
(8) Joseph E. Gonzalez, UC Berkeley;
(9) Hao Zhang, UC San Diego;
(10) Ion Stoica, UC Berkeley. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background and 2.1 Transformer-Based Large Language Models 2 Background and 2.1 Transformer-Based Large Language Models 2.2 LLM Service & Autoregressive Generation 2.2 LLM Service & Autoregressive Generation 2.3 Batching Techniques for LLMs 2.3 Batching Techniques for LLMs 3 Memory Challenges in LLM Serving 3 Memory Challenges in LLM Serving 3.1 Memory Management in Existing Systems 3.1 Memory Management in Existing Systems 4 Method and 4.1 PagedAttention 4 Method and 4.1 PagedAttention 4.2 KV Cache Manager 4.2 KV Cache Manager 4.3 Decoding with PagedAttention and vLLM 4.3 Decoding with PagedAttention and vLLM 4.4 Application to Other Decoding Scenarios 4.4 Application to Other Decoding Scenarios 4.5 Scheduling and Preemption 4.5 Scheduling and Preemption 4.6 Distributed Execution 4.6 Distributed Execution 5 Implementation 5 Implementation 6 Evaluation and 6.1 Experimental Setup 6 Evaluation and 6.1 Experimental Setup 6.2 Basic Sampling 6.2 Basic Sampling 6.3 Parallel Sampling and Beam Search 6.3 Parallel Sampling and Beam Search 6.4 Shared prefix 6.4 Shared prefix 6.5 Chatbot 6.5 Chatbot 7 Ablation Studies 7 Ablation Studies 8 Discussion 8 Discussion 9 Related Work 9 Related Work 10 Conclusion, Acknowledgement and References 10 Conclusion, Acknowledgement and References 9 Related Work General model serving systems. Model serving has been an active area of research in recent years, with numerous systems proposed to tackle diverse aspects of deep learning model deployment. Clipper [11], TensorFlow Serving [33], Nexus [45], InferLine [10], and Clockwork [20] are some earlier general model serving systems. They study batching, caching, placement, and scheduling for serving single or multiple models. More recently, DVABatch [12] introduces multi-entry multi-exit batching. REEF [21] and Shepherd [61] propose preemption for serving. AlpaServe [28] utilizes model parallelism for statistical multiplexing. However, these general systems fail to take into account the autoregressive property and token state of LLM inference, resulting in missed opportunities for optimization. General model serving systems. Specialized serving systems for transformers. Due to the significance of the transformer architecture, numerous specialized serving systems for it have been developed. These systems utilize GPU kernel optimizations [1, 29, 31, 56], advanced batching mechanisms [14, 60], model parallelism [1, 41, 60], and parameter sharing [64] for efficient serving. Among them, Orca [60] is most relevant to our approach. Specialized serving systems for transformers. Comparison to Orca. The iteration-level scheduling in Orca [60] and PagedAttention in vLLM are complementary techniques: While both systems aim to increase the GPU utilization and hence the throughput of LLM serving, Orca achieves it by scheduling and interleaving the requests so that more requests can be processed in parallel, while vLLM is doing so by increasing memory utilization so that the working sets of more requests fit into memory. By reducing memory fragmentation and enabling sharing, vLLM runs more requests in a batch in parallel and achieves a 2-4× speedup compared to Orca. Indeed, the fine-grained scheduling and interleaving of the requests like in Orca makes memory management more challenging, making the techniques proposed in vLLM even more crucial. Comparison to Orca. Memory optimizations. The widening gap between the compute capability and memory capacity of accelerators has caused memory to become a bottleneck for both training and inference. Swapping [23, 42, 55], recomputation [7, 24] and their combination [40] have been utilized to reduce the peak memory of training. Notably, FlexGen [46] studies how to swap weights and token states for LLM inference with limited GPU memory, but it does not target the online serving settings. OLLA [48] optimizes the lifetime and location of tensors to reduce fragmentation, but it does not do finegrained block-level management or online serving. FlashAttention [13] applies tiling and kernel optimizations to reduce the peak memory of attention computation and reduce I/O costs. This paper introduces a new idea of block-level memory management in the context of online serving. Memory optimizations. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors: (1) Woosuk Kwon, UC Berkeley with Equal contribution; (2) Zhuohan Li, UC Berkeley with Equal contribution; (3) Siyuan Zhuang, UC Berkeley; (4) Ying Sheng, UC Berkeley and Stanford University; (5) Lianmin Zheng, UC Berkeley; (6) Cody Hao Yu, Independent Researcher; (7) Cody Hao Yu, Independent Researcher; (8) Joseph E. Gonzalez, UC Berkeley; (9) Hao Zhang, UC San Diego; (10) Ion Stoica, UC Berkeley. Authors: Authors: (1) Woosuk Kwon, UC Berkeley with Equal contribution; (2) Zhuohan Li, UC Berkeley with Equal contribution; (3) Siyuan Zhuang, UC Berkeley; (4) Ying Sheng, UC Berkeley and Stanford University; (5) Lianmin Zheng, UC Berkeley; (6) Cody Hao Yu, Independent Researcher; (7) Cody Hao Yu, Independent Researcher; (8) Joseph E. Gonzalez, UC Berkeley; (9) Hao Zhang, UC San Diego; (10) Ion Stoica, UC Berkeley.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

General Model Serving Systems and Memory Optimizations Explained

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Issues with PagedAttention: Kernel Rewrites and Complexity in LLM Serving

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

The Noonification: Exploring Tool-Integrated Reasoning: Innovating Math-Proficient LLMs (10/5/2023)

The Noonification: Simple Database Migration Scripts On Your CI/CD step (10/16/2023)

The Noonification: Breaking Axioms in Program Execution (11/5/2023)

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Issues with PagedAttention: Kernel Rewrites and Complexity in LLM Serving

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

The Noonification: Exploring Tool-Integrated Reasoning: Innovating Math-Proficient LLMs (10/5/2023)

The Noonification: Simple Database Migration Scripts On Your CI/CD step (10/16/2023)

The Noonification: Breaking Axioms in Program Execution (11/5/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps