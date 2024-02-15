Search icon
    FlashDecoding++: Faster Large Language Model Inference on GPUs: Related Works

    February 15th, 2024
    Due to the versatility of optimizations in FlashDecoding++, it can achieve up to 4.86× and 2.18× speedup on both NVIDIA and AMD GPUs compared to Hugging Face.
    This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

    Authors:

    (1) Ke Hong, Tsinghua University & Infinigence-AI;

    (2) Guohao Dai, Shanghai Jiao Tong University & Infinigence-AI;

    (3) Jiaming Xu, Shanghai Jiao Tong University & Infinigence-AI;

    (4) Qiuli Mao, Tsinghua University & Infinigence-AI;

    (5) Xiuhong Li, Peking University;

    (6) Jun Liu, Shanghai Jiao Tong University & Infinigence-AI;

    (7) Kangdi Chen, Infinigence-AI;

    (8) Yuhan Dong, Tsinghua University;

    (9) Yu Wang, Tsinghua University.

    Large language model inference acceleration has gained significant attention in recent research, with several notable approaches and techniques emerging in the field. DeepSpeed [9] is a comprehensive engine that optimizes both the


    Figure 11: Speedup of the prefill phase on NVIDIA GPUs.


    training and inference phases for LLMs. It achieves robust inference performance through kernel fusion and efficient GPU memory management, with a particular focus on optimizing memory usage for KVcache. vLLM [11] improves GPU memory utilization by efficient memory management techniques and the PageAttention method, leading to increased maximum batch sizes and elevating the upper limit of inference performance. FlashAttention [18, 19] optimizes the self-attention computation process during the prefill phase through improved parallelism and workload distribution. FlashDecoding [13] is an extension of FlashAttention and enhances the parallelism through spliting K and V , supporting efficient self-attention computation for long sequence during the decode phase. FasterTransformer [33] and OpenPPL [12] implement large model inference engines using C++ to reduce overhead resulting from kernels scheduling, compared to Python implementations. They also employ memory management techniques and kernel fusion to achieve efficient LLM inference. TensorRT-LLM [14] is built upon the TensorRT [38] and the FasterTransformer


    Figure 12: Speedup of the decode phase on AMD RX7900XTX


    Figure 13: Speedup of the decode phase on AMD MI210.


    [33] engine (C++) and incorporates cutting-edge open-source technologies such as FlashAttention [18, 19]. Additionally, it enhances its ease of use by providing the Python API.

