This story draft by @escholar has not been reviewed by an editor, YET.

CLLMs: Consistency Large Language Models: Ablation Studies

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Siqi Kou, Shanghai Jiao Tong University and with Equal contribution;

(2) Lanxiang Hu, University of California, San Diego and with Equal contribution;

(3) Zhezhi He, Shanghai Jiao Tong University;

(4) Zhijie Deng, Shanghai Jiao Tong University;

(5) Hao Zhang, University of California, San Diego.

Table of Links

Abstract and 1 Introduction

2. Related Work

3. Methodology and 3.1. Preliminary: Jacobi Decoding

3.2. Consistency Large Language Models (CLLMs)

3.3. Acceleration Mechanisms in CLLMs

4. Experiments

4.1. Evaluations

4.2. Acceleration Mechanisms in CLLMs

4.3. Ablation Studies

4.4. Limitations and Discussion

5. Conclusion, Impact Statement, and References

A. Illustration of Consistency Loss Learning Objectives

B. Comparison with Baseline Algorithms

C. Pesudo Code for Jacobi Decoding with KV Cache

4.3. Ablation Studies

Dataset sizes and generalizability. In Section 3.2.1, Jacobi trajectory datasets are collected to conduct training for efficient Jacobi decoding. Table 4 demonstrates larger Jacobi trajectory datasets bring more significant speedup, and the speedup gradually saturates as the dataset size scales. Moreover, CLLMs trained with more data can perform well even at the n-token sequence lengths it’s not trained on and introduce more deployment-time robustness.


Different lengths of n-token sequence. We investigate how different n-token sequence lengths in the Jacobi trajectory dataset affect CLLMs’ performance on GSM8K. We employ varying lengths to generate the Jacobi dataset and train the CLLMs accordingly. Figure 3 illustrates that CLLMs consistently maintain generation quality while the models are trained with different lengths. In practice, longer sequence lengths come at cost of increased computational overhead during inference. In Figure 3, significant degradation inference speed can thus be observed when the when the n-token sequence length exceeds 64.


Loss design. We adjust the ratio of consistency loss to autoregressive loss described in Section 3.2.2 and evaluate different loss ratios’ performance on GSM8K. As illustrated in Table 6, increasing the emphasis on autoregressive loss does indeed enhance accuracy, though it slightly compromises the speedup gains. Additionally, we compare the efficacy of CLLMs using both consistency global loss and consistency local loss. Table 6 demonstrates that the global loss is more efficacious in the training of CLLMs.


This paper is available on arxiv under CC0 1.0 Universal license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks