Table of Links
- Proposed Method: Quantized DyLoRA
- Experiments and Evaluation
- On the semi-sorted behavior of QDyLoRA
- Conclusion, Limitations, and References
A. Supplementary Material
3 Experiments and Evaluation
This section evaluates the efficiency and efficacy of QDyLoRA through several instruct-fine-tuning
tasks. The first experiment compares QDyLoRA with QLoRA on Massively Multitask Language Understating (MMLU) benchmark (Hendrycks et al., 2020), consisting of more than 50 different tasks, spanning from fundamental mathematics and U.S. history to computer science and law. As shown in Table 1 [1], we finetune LLaMA-7b, LLaMA-13b, LLaMA2-13b, and Falcon40b on different datasets, Alpaca (Taori et al., 2023), OASST1 (Köpf et al., 2023), Self-Instruct (Wang et al., 2022), and FLANv2 (Chung et al., 2022), using QLoRA and QDyLoRA techniques. We use the same training budget and maximum LoRA rank[2] for each technique. The results consistently show that QDyLoRA achieves a superior performance by finding the optimal rank.
The second experiment provides a more in-depth comparison between QLoRA and QDyLoRA. In particular, we fairly finetuned Falcon-40b on WebGLM (Liu et al., 2023) and GSM8k (Cobbe et al., 2021) benchmarks, and compared their test performances across different ranks. As described in Table 2, QDyLoRA attains superior performance, notably when employing its optimal ranks (Rank 2 for Web-GLM and Rank 8 for GSM8k). Furthermore, QDyLoRA exhibits consistent superiority over QLoRA, particularly at lower ranks. These findings emphasize the adaptive nature of QDyLoRA in dynamically adjusting its focus during fine-tuning, leading to enhanced efficiency and efficacy compared to its static counterpart, QLoRA. The third experiment compares the performance of DyLoRA, QDyLoRA, and QLoRA on GSM8k and TriviaQA (Joshi et al., 2017) while adopting LLaMA2-13b and LLaMA-7b as LLMs. Table 3 reports the results. As the table illustrates, for smaller-size models, i.e. LLaMA-7b, DyLoRA and QDyLoRA both perform superior than QLoRA. For larger models, i.e. LLaMA2-13b, DyLoRA fails due to the out-of-memory (OOM) error while QDyLoRA works the best in such situations.
Authors:
(1) Hossein Rajabzadeh, University of Waterloo and Huawei Noah’s Ark Lab ([email protected]);
(2) Mojtaba Valipour, University of Waterloo ([email protected]);
(3) Tianshu Zhu, Huawei Noah’s Ark Lab ([email protected]);
(4) Marzieh Tahaei, Huawei Noah’s Ark Lab ([email protected]);
(5) Hyock Ju Kwon, ([email protected]);
(6) Ali Ghodsi, ([email protected]);
(7) Boxing Chen, Huawei Noah’s Ark Lab ([email protected]);
(8) Mehdi Rezagholizadeh, Huawei Noah’s Ark Lab ([email protected]).
This paper is
[1] The same settings as the original QLoRA work are applied here.
[2] The maximum LoRA rank is fixed to 64. While QLoRA’s rank is always fixed, QDyLoRA can split the training across ranks in range 1 to 64.