Table of Links
-
Experiments
4.2 Analysis of Individual PELT Methods
In Table 1, we show the performance of different methods on the GLUE benchmark with various sizes of training data. The results on the development sets are generally consistent with the test sets and provided in App. D. Although the average performance of different methods over 8 tasks is sometimes similar, the differences between tasks are quite significant under certain setups and can be as large as 5~9 points on a specific task (e.g., STS-B and MNLI, K = 500) even when excluding cases where some methods fail to learn effectively (e.g., prefix-tuning on QQP, K = 100).
Next, we will analyze and examine each individual PELT method more closely.
On the other hand, there are certain tasks (e.g., STS-B) that adapter largely outperforms competitive methods such as prefix-tuning and LoRA regardless of the size of training data, suggesting that one should favor adapter over other PELT methods under certain scenarios as well.
Analysis of Prefix-tuning. Prefix-tuning performs poorly with K = {100, 500} and becomes on par with fine-tuning when K reaches 1000. We also observe that prefix-tuning fails to learn effectively on certain tasks when the training data is limited (e.g., K = 100 on SST-2 and K = 500 on QQP), leading to unsatisfactory performance and (or) large variance across different runs. Similar phenomena have been observed in a concurrent study (Gu et al., 2021) on few-shot prompt-tuning.
To ensure that the poor performance of prefixtuning is not due to its fewer trainable parameters (based on its default setting), we further increase the prefix length to L = 50 such that its trainable parameters are comparable to adapter, and reevaluate prefix-tuning on all 8 tasks with K = 100. For the 4 tasks where prefix-tuning (L = 10) performs poorly (SST2, CoLA, STS-B, and QQP), while its performance is significantly improved on 3 tasks, it also performs significantly worse on the other task (STS-B), which suggests that training instability in the low-resource regime is still an issue for prefix-tuning even with more trainable parameters.[5] Besides, prefix-tuning (L = 50) still lags behind adapter or UNIPELT (AP) on 3 of the 4 tasks. Furthermore, the average performance of prefix-tuning (L = 50) on 8 tasks is even slightly worse than with L = 10, which indicates that increasing prefix length may not be a panacea for all the scenarios. A larger L also leads to significant training/inference slowdown due to the costly multi-head attention. More broadly, such results suggest that using more trainable parameters does not guarantee better performance.
On the bright side, prefix-tuning performs well on certain tasks such as natural language inference (RTE and MNLI) with various sizes of training data, which suggests that one should also prefer prefix-tuning in certain cases.
Analysis of BitFit & LoRA. Tuning only the bias terms of the model does not lead to very satisfactory results in our experiments – BitFit never performs the best and generally performs the worst in different data and task setups. Therefore, we do not consider BitFit in the following experiments and exclude BitFit as a submodule of UNIPELT. As for LoRA, there are a few setups where LoRA fails to learn effectively as well, such as STS-B and QQP (K = {100, 500}), leading to high variance across runs. Apart from that, LoRA performs
quite competitively despite using fewer trainable parameters than methods like adapter, especially when K = 1000, achieving the best or 2nd best performance on 4 of 8 tasks.
As LoRA has a scaling factor α that can be seen as a static gating function under our formulation, we further investigate its importance by evaluating LoRA with different α. As shown in Fig. 3, LoRA is quite sensitive to the scaling factor and there seems to be no single optimal value that works well across multiple task and data setups. Such findings suggest that gating is critical and motivate us to use more fine-grained and dynamic control for UNIPELT. Besides, we observe that increasing α consistently results in faster convergence, possibly because the trainable parameters would receive larger gradient updates with a larger α.
Authors:
(1) Yuning Mao, University of Illinois Urbana-Champaign and the work was done during internship at Meta AI ([email protected]);
(2) Lambert Mathias, Meta AI ([email protected]);
(3) Rui Hou, Meta AI ([email protected]);
(4) Amjad Almahairi, Meta AI ([email protected]);
(5) Hao Ma, Meta AI ([email protected]);
(6) Jiawei Han, University of Illinois Urbana-Champaign ([email protected]);
(7) Wen-tau Yih, Meta AI ([email protected]);
(8) Madian Khabsa, Meta AI ([email protected]).
This paper is
[5] Tuning other hyperparameters like learning rate does not appear to alleviate the issue either.