Table of Links
A. The Connection Between Prefix-tuning and Hypernetwork
B. Number of Tunable Parameters
6. Results and Analysis
We design a series of experiments for pure language and V&L tasks on both multi-tasking and few-shot scenarios to verify the effectiveness of our proposed framework compared to existing methods.
6.1. Results on the GLUE Benchmark
For Prefix-tuning (Li & Liang, 2021) and MAMAdapter (He et al., 2021), their original implementation is single-task training on BART (Lewis et al., 2020). To make a fair comparison to other baselines, we apply their methods to T5 in a multi-task training setting. [4] For each model, we share the parameters of both prefix vectors and adapter weights across multitasks.
Overall, our HyperPELT method obtains the best performance with less trainable parameters. Compared to the single-task Adapters that finetunes all the introduced parameters in adapters, our method yields a significant improvement by 2.21% with much fewer trainable parameters, which illustrates the effectiveness of our proposed multi-task training framework.
In multi-task training, the proposed hypernetwork-based prefix-tuning strategy, e.g., HyperPrefix, decreases the number of trainable parameters (e.g., 1.01× of HyperPrefix vs. 1.14× of Prefix-tuning), while achieves a better performance at the same time (e.g., 86.65% of HyperPrefix vs. 86.09% of Prefix-tuning). It is noticeable that the number of trainable parameters per task is 11× fewer than Prefix-tuning.
HyperPELT obtains a superior performance over HyperPrefix, and the main reason lies in that we further combine the hypernetwork-based adapters and add them to the feedforward layers in a parallel manner. In this way, the average
performance is further enhanced (+0.44%) by involving a small number of parameters (0.09% parameters per task). The comparison to MAMAdapter shows that using hypernetwork to tune each transformer block and learn the shared knowledge across multitasks leads to an improvement.
6.2. Few-shot Domain Transfer
We use the above models trained on GLUE as reported in Table 1, and evaluate them on the test set of five different tasks after being few-shot finetuned on each target training data. Following Mahabadi et al. (2021), we use the task embedding respectively trained on the most similar GLUE task for initialization, i.e., MNLI for CB, QNLI for QA, SST2 for sentiment analysis, and QQP for paraphrase detection.
As suggested by Perez et al. (2021) and Zhao & Schutze ¨ (2021), we randomly select the same number of samples from training and validation set, making it a reasonable fewshot scenario. Checkpoints are selected via early stopping on the selected validation set, and the stopping metric is the default metric for each task.
In the first three columns of Table 2, we show the results of full fine-tuning of T5BASE, HYPERFORMER++ (finetuning both hypernetworks and task embeddings) and our proposed HyperPELT. Overall, our method achieves the best performance in the few-shot tasks.
For the tasks of CB and BoolQ from SuperGLUE, even though the backbone T5 was previously trained on the train sets of these two, the performance of all methods differs a lot. The two baselines still do not work with very few samples, like 4 and 16 samples, while our method is significantly better than them. Therefore, we assume that the two baselines suffer from catastrophic forgetting problems to some degree during multi-task training. In contrast, our proposed HyperPELT works effectively on these two tasks. We speculate that the reason might be the use of hypernetworks on both prefix-tuning and adapter-tuning modules of transformer. We leave this exploration to our future work.
6.3. Results on the Vision-Language Benchmarks
we move to the experiments of applying the proposed hypernetwork-based parameter-efficient training framework to V&L tasks. We compare to the pre-trained and full fine-tuning VL-T5 (Cho et al., 2021), and other adapter-based methods built on top of T5, i.e., CLIP-T5 and VLAdapter (Sung et al., 2021) in the multi-task training setting.
To our best knowledge, we are the first to employ the visual modality to tune the very few parameters of different transformer blocks, instead of normally inserting image patch tokens to the input sequence. Experimental results evidence the effectiveness of our novel approach, thus providing a new perspective on how to extend the multi-modality capability on top of PLMs. It is to use the features from different modalities as the input of a hypernetwork to generate parameters for modules in PLMs, instead of as a part of input sequence to accomplish the multimodal tasks. One advantage in our approach is still keeping the original maximum text input length, since no other modalities such as visual and audio features occupy it. It is promisingly useful in document-level and text-heavy tasks such as multimodal summarization (Zhang et al., 2022).
We believe the resulting performance might be even better with a more complex design combination of methods across tuning task-specific and visual-specific parameters in PLMs, but we leave this exploration in future work.
6.4. Multimodal Few-shot Learning
We further use the models trained on V&L tasks as reported in Table 4 and evaluate them on the test set after few-shot fine-tuning on OKVQA (Marino et al., 2019) and SNLIVE (Xie et al., 2018). For OKVQA, since there is no test set, we split its original validation set into two halves, one for validating and the other for testing. For SNLI-VE, we use its validation set for validating, and test-P set for testing and reporting results. We follow the methods in Section 6.2 to select samples and report results in Table 4.
Compared with the full parameter fine-tuning, i.e., CLIP-T5, and the other baseline VL-Adapter, our method achieves the best performance with smaller variances in this multimodal few-shot learning setting. We find that VL-Adapter is inferior to CLIP-T5 when with fewer samples (e.g., fewer than 500) on the OKVQA dataset. The reason may be that there exists a lot of out-domain knowledge and complex image content in OKVQA, which makes it more challenging for the parameter-efficient VL-Adapter to achieve accurate prediction. In other words, the small number of samples are not enough to train the introduced randomly initialized parameters in VL-Adapter.
However, our approach can still tackle with fewer samples. We use the hypernetwork to generate trainable parameters in adapters and multi-head attention, as well as directly integrating image features into attention modules in the form of prefix tuning vectors. We believe such method, though training less parameters, can still capture knowledge across tasks and transfer them in a few-shot setting. It is also worth noting that for the used five random seeds, the variance of our method is generally smaller than VL-Adapter, which indicates that our method is more robust in this few-shot learning scenario.
Authors:
(1) Zhengkun Zhang, with Equal contribution from Work is done at the internship of Noah’s Ark Lab, Huawei Technologies
(2) Wenya Guo and TKLNDST, CS, Nankai University, China ([email protected]);
(3) Xiaojun Meng, with Equal contribution from Noah’s Ark Lab, Huawei Technologies;
(4) Yasheng Wang, Noah’s Ark Lab, Huawei Technologies;
(5) Yadao Wang, Noah’s Ark Lab, Huawei Technologies;
(6) Xin Jiang, Noah’s Ark Lab, Huawei Technologies;
(7) Qun Liu, Noah’s Ark Lab, Huawei Technologies;
(8) Zhenglu Yang, TKLNDST, CS, Nankai University, China.
This paper is
[4] For adapting Prefix-tuning from BART to T5, a noteworthy point is that since they use different position embedding, i.e., absolute position embedding for BART and relative position embedding for T5, it is necessary to manually concatenate all-zero vectors on the relative position bias of each layer in T5.