How Effective is vLLM When a Prefix Is Thrown Into the Mix?

Table of Links

6.4 Shared prefix

We explore the effectiveness of vLLM for the case a prefix is shared among different input prompts, as illustrated in

Fig. 10. For the model, we use LLaMA-13B [52], which is multilingual. For the workload, we use the WMT16 [4] Englishto-German translation dataset and synthesize two prefixes that include an instruction and a few translation examples. The first prefix includes a single example (i.e., one-shot) while the other prefix includes 5 examples (i.e., few-shot). As shown in Fig. 16 (a), vLLM achieves 1.67× higher throughput than Orca (Oracle) when the one-shot prefix is shared. Furthermore, when more examples are shared (Fig. 16 (b)), vLLM achieves 3.58× higher throughput than Orca (Oracle).

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Woosuk Kwon, UC Berkeley with Equal contribution;

(2) Zhuohan Li, UC Berkeley with Equal contribution;

(3) Siyuan Zhuang, UC Berkeley;

(4) Ying Sheng, UC Berkeley and Stanford University;

(5) Lianmin Zheng, UC Berkeley;

(6) Cody Hao Yu, Independent Researcher;

(7) Cody Hao Yu, Independent Researcher;

(8) Joseph E. Gonzalez, UC Berkeley;

(9) Hao Zhang, UC San Diego;

(10) Ion Stoica, UC Berkeley.

How Effective is vLLM When a Prefix Is Thrown Into the Mix?

Too Long; Didn't Read

Table of Links

6.4 Shared prefix

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

How Effective is vLLM When a Prefix Is Thrown Into the Mix?

Too Long; Didn't Read

Table of Links

6.4 Shared prefix

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES