Table of Links Abstract and 1 Introduction 2 Background and 2.1 Transformer-Based Large Language Models 2.2 LLM Service & Autoregressive Generation 2.3 Batching Techniques for LLMs 3 Memory Challenges in LLM Serving 3.1 Memory Management in Existing Systems 4 Method and 4.1 PagedAttention 4.2 KV Cache Manager 4.3 Decoding with PagedAttention and vLLM 4.4 Application to Other Decoding Scenarios 4.5 Scheduling and Preemption 4.6 Distributed Execution 5 Implementation 6 Evaluation and 6.1 Experimental Setup 6.2 Basic Sampling 6.3 Parallel Sampling and Beam Search 6.4 Shared prefix 6.5 Chatbot 7 Ablation Studies 8 Discussion 9 Related Work 10 Conclusion, Acknowledgement and References 6.5 Chatbot A chatbot [8, 19, 35] is one of the most important applications of LLMs. To implement a chatbot, we let the model generate a response by concatenating the chatting history and the last user query into a prompt. We synthesize the chatting history and user query using the ShareGPT dataset. Due to the limited context length of the OPT-13B model, we cut the prompt to the last 1024 tokens and let the model generate at most 1024 tokens. We do not store the KV cache between different conversation rounds as doing this would occupy the space for other requests between the conversation rounds. Fig. 17 shows that vLLM can sustain 2× higher request rates compared to the three Orca baselines. Since the ShareGPT dataset contains many long conversations, the input prompts for most requests have 1024 tokens. Due to the buddy allocation algorithm, the Orca baselines reserve the space for 1024 tokens for the request outputs, regardless of how they predict the output lengths. For this reason, the three Orca baselines behave similarly. In contrast, vLLM can effectively handle the long prompts, as PagedAttention resolves the problem of memory fragmentation and reservation. This paper is available on arxiv under CC BY 4.0 DEED license. Authors:
(1) Woosuk Kwon, UC Berkeley with Equal contribution;
(2) Zhuohan Li, UC Berkeley with Equal contribution;
(3) Siyuan Zhuang, UC Berkeley;
(4) Ying Sheng, UC Berkeley and Stanford University;
(5) Lianmin Zheng, UC Berkeley;
(6) Cody Hao Yu, Independent Researcher;
(7) Cody Hao Yu, Independent Researcher;
(8) Joseph E. Gonzalez, UC Berkeley;
(9) Hao Zhang, UC San Diego;
(10) Ion Stoica, UC Berkeley. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background and 2.1 Transformer-Based Large Language Models 2 Background and 2.1 Transformer-Based Large Language Models 2.2 LLM Service & Autoregressive Generation 2.2 LLM Service & Autoregressive Generation 2.3 Batching Techniques for LLMs 2.3 Batching Techniques for LLMs 3 Memory Challenges in LLM Serving 3 Memory Challenges in LLM Serving 3.1 Memory Management in Existing Systems 3.1 Memory Management in Existing Systems 4 Method and 4.1 PagedAttention 4 Method and 4.1 PagedAttention 4.2 KV Cache Manager 4.2 KV Cache Manager 4.3 Decoding with PagedAttention and vLLM 4.3 Decoding with PagedAttention and vLLM 4.4 Application to Other Decoding Scenarios 4.4 Application to Other Decoding Scenarios 4.5 Scheduling and Preemption 4.5 Scheduling and Preemption 4.6 Distributed Execution 4.6 Distributed Execution 5 Implementation 5 Implementation 6 Evaluation and 6.1 Experimental Setup 6 Evaluation and 6.1 Experimental Setup 6.2 Basic Sampling 6.2 Basic Sampling 6.3 Parallel Sampling and Beam Search 6.3 Parallel Sampling and Beam Search 6.4 Shared prefix 6.4 Shared prefix 6.5 Chatbot 6.5 Chatbot 7 Ablation Studies 7 Ablation Studies 8 Discussion 8 Discussion 9 Related Work 9 Related Work 10 Conclusion, Acknowledgement and References 10 Conclusion, Acknowledgement and References 6.5 Chatbot A chatbot [8, 19, 35] is one of the most important applications of LLMs. To implement a chatbot, we let the model generate a response by concatenating the chatting history and the last user query into a prompt. We synthesize the chatting history and user query using the ShareGPT dataset. Due to the limited context length of the OPT-13B model, we cut the prompt to the last 1024 tokens and let the model generate at most 1024 tokens. We do not store the KV cache between different conversation rounds as doing this would occupy the space for other requests between the conversation rounds. Fig. 17 shows that vLLM can sustain 2× higher request rates compared to the three Orca baselines. Since the ShareGPT dataset contains many long conversations, the input prompts for most requests have 1024 tokens. Due to the buddy allocation algorithm, the Orca baselines reserve the space for 1024 tokens for the request outputs, regardless of how they predict the output lengths. For this reason, the three Orca baselines behave similarly. In contrast, vLLM can effectively handle the long prompts, as PagedAttention resolves the problem of memory fragmentation and reservation. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors: (1) Woosuk Kwon, UC Berkeley with Equal contribution; (2) Zhuohan Li, UC Berkeley with Equal contribution; (3) Siyuan Zhuang, UC Berkeley; (4) Ying Sheng, UC Berkeley and Stanford University; (5) Lianmin Zheng, UC Berkeley; (6) Cody Hao Yu, Independent Researcher; (7) Cody Hao Yu, Independent Researcher; (8) Joseph E. Gonzalez, UC Berkeley; (9) Hao Zhang, UC San Diego; (10) Ion Stoica, UC Berkeley. Authors: Authors: (1) Woosuk Kwon, UC Berkeley with Equal contribution; (2) Zhuohan Li, UC Berkeley with Equal contribution; (3) Siyuan Zhuang, UC Berkeley; (4) Ying Sheng, UC Berkeley and Stanford University; (5) Lianmin Zheng, UC Berkeley; (6) Cody Hao Yu, Independent Researcher; (7) Cody Hao Yu, Independent Researcher; (8) Joseph E. Gonzalez, UC Berkeley; (9) Hao Zhang, UC San Diego; (10) Ion Stoica, UC Berkeley.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

How We Implemented a Chatbot Into Our LLM

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

10 ways how designing a chat bot is different from designing an app

10 Tips to Get the Most out of ChatGPT

10 Best AI Chatbot Builder for Your Business in 2022

11 Of The Best Artificial Intelligence Enterprise Chatbots in 2022

12 Lessons Learned from 12 Rejections Submitting Actions on Google

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

10 ways how designing a chat bot is different from designing an app

10 Tips to Get the Most out of ChatGPT

10 Best AI Chatbot Builder for Your Business in 2022

11 Of The Best Artificial Intelligence Enterprise Chatbots in 2022

12 Lessons Learned from 12 Rejections Submitting Actions on Google

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps