This story draft by @escholar has not been reviewed by an editor, YET.

CLLMs: Consistency Large Language Models: Limitations and Discussion

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Siqi Kou, Shanghai Jiao Tong University and with Equal contribution;

(2) Lanxiang Hu, University of California, San Diego and with Equal contribution;

(3) Zhezhi He, Shanghai Jiao Tong University;

(4) Zhijie Deng, Shanghai Jiao Tong University;

(5) Hao Zhang, University of California, San Diego.

Table of Links

Abstract and 1 Introduction

2. Related Work

3. Methodology and 3.1. Preliminary: Jacobi Decoding

3.2. Consistency Large Language Models (CLLMs)

3.3. Acceleration Mechanisms in CLLMs

4. Experiments

4.1. Evaluations

4.2. Acceleration Mechanisms in CLLMs

4.3. Ablation Studies

4.4. Limitations and Discussion

5. Conclusion, Impact Statement, and References

A. Illustration of Consistency Loss Learning Objectives

B. Comparison with Baseline Algorithms

C. Pesudo Code for Jacobi Decoding with KV Cache

4.4. Limitations and Discussion

In our experiments, we observe that achieving significant speedup while maintaining good generation quality with a CLLM relies strongly on having a high-quality Jacobi


Figure 3. Accuracy and speedup of models trained with different n-token sequences lengths on GSM8K dataset. The sequence length for generation matches the training settings. Speedup is measured as the ratio of the wall-clock generation throughput when employing Jacobi decoding, and that of the baseline AR decoding.


trajectory dataset. Therefore, data cleaning is crucial, as discussed in Section 3.2.1. Dataset size also plays a role as described in Section 4.3 and shown in Table 4, although to a lesser extent. For instance, Jacobi trajectories generated with only 10% of the Code-Search-Net Python dataset is able to yield a 2.9× speedup as demonstrated in Table 2. However, for open-domain datasets like ShareGPT, more data is necessary for improved efficiency.


In our proposed method and experiments, we primarily use output sequences from the teacher (Kim & Rush, 2016) to collect Jacobi trajectories and train a CLLM. This introduces some additional overhead in comparison with conventional model training. On-policy GKD proposed in Agarwal et al. (2023) suggests LLM distillation using a mixture of teacher and student samples or even student samples by themselves can yield high-performance models. One mitigation is therefore to use n-token sequences generated by


Table 4. Comparison the performance of CLLMs trained with different sizes of Jacobi trajectory datasets on ShareGPT.


Table 5. CLLMs’ performance versus the fine-tuned baseline on language modeling tasks.


the trained model itself as the training samples. This can remove the Jacobi trajectory collection overhead, making our proposed method potentially feasible for pre-training.


Results from our language modeling experiments, as detailed in Table 5, demonstrate the robustness of the CLLM when trained on pre-training jobs with a notable speedup. By incorporating on-policy GKD, it is conceivable that a modified version of our proposed method could be employed for LLM pre-training. This modification would equip the pre-trained model with both a strong language modeling capability, as existing models possess, and a high generation speed when employing Jacobi decoding for inference. We leave the opportunities of adapting CLLMs to pre-trained jobs for future work.


This paper is available on arxiv under CC0 1.0 Universal license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks