This story draft by @escholar has not been reviewed by an editor, YET.

CLLMs: Consistency Large Language Models: Comparison with Baseline Algorithms

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Siqi Kou, Shanghai Jiao Tong University and with Equal contribution;

(2) Lanxiang Hu, University of California, San Diego and with Equal contribution;

(3) Zhezhi He, Shanghai Jiao Tong University;

(4) Zhijie Deng, Shanghai Jiao Tong University;

(5) Hao Zhang, University of California, San Diego.

Table of Links

Abstract and 1 Introduction

2. Related Work

3. Methodology and 3.1. Preliminary: Jacobi Decoding

3.2. Consistency Large Language Models (CLLMs)

3.3. Acceleration Mechanisms in CLLMs

4. Experiments

4.1. Evaluations

4.2. Acceleration Mechanisms in CLLMs

4.3. Ablation Studies

4.4. Limitations and Discussion

5. Conclusion, Impact Statement, and References

A. Illustration of Consistency Loss Learning Objectives

B. Comparison with Baseline Algorithms

C. Pesudo Code for Jacobi Decoding with KV Cache

B. Comparison with Baseline Algorithms

In this section, we present a comparative analysis of baseline algorithms for efficient LLM inference. Key features considered are listed below. Table 7 underlines that CLLMs, our proposed method, stands out for its memory efficiency and adaptability, requiring no modifications to the existing model architecture while achieving up to 3.4× inference speedup.


Lossless: whether the method generates exactly the same output distribution as AR decoding does in the backbone model. • Training-free: whether the method requires training.


• Architecture-design-free: whether the method requires modifications or adding auxiliary components to pre-trained LLMs (like extra MLP layers, LM heads (Cai et al., 2024), autoregressive heads (Li et al., 2024), etc.).


• Attention-modification-free: whether the methods require modifications to exisiting attention mechanism in transformers. For example, this includes tree token verification as appears in Cai et al. (2024).


• Extra-memory-free: whether the method requires extra memory consumption in the system to accommodate speculative model or extra parameters.


Speedup: Whether the method can effectively deliver inference speedup in practical use cases.



This paper is available on arxiv under CC0 1.0 Universal license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks