Cut 90% of Fine-Tuning Cost—Still Beat Baselines on Text and Vision Benchmarks

Table of Links

6. Results and Analysis

We design a series of experiments for pure language and V&L tasks on both multi-tasking and few-shot scenarios to verify the effectiveness of our proposed framework compared to existing methods.

6.1. Results on the GLUE Benchmark

For Prefix-tuning (Li & Liang, 2021) and MAMAdapter (He et al., 2021), their original implementation is single-task training on BART (Lewis et al., 2020). To make a fair comparison to other baselines, we apply their methods to T5 in a multi-task training setting. [4] For each model, we share the parameters of both prefix vectors and adapter weights across multitasks.

Overall, our HyperPELT method obtains the best performance with less trainable parameters. Compared to the single-task Adapters that finetunes all the introduced parameters in adapters, our method yields a significant improvement by 2.21% with much fewer trainable parameters, which illustrates the effectiveness of our proposed multi-task training framework.

In multi-task training, the proposed hypernetwork-based prefix-tuning strategy, e.g., HyperPrefix, decreases the number of trainable parameters (e.g., 1.01× of HyperPrefix vs. 1.14× of Prefix-tuning), while achieves a better performance at the same time (e.g., 86.65% of HyperPrefix vs. 86.09% of Prefix-tuning). It is noticeable that the number of trainable parameters per task is 11× fewer than Prefix-tuning.

HyperPELT obtains a superior performance over HyperPrefix, and the main reason lies in that we further combine the hypernetwork-based adapters and add them to the feedforward layers in a parallel manner. In this way, the average

performance is further enhanced (+0.44%) by involving a small number of parameters (0.09% parameters per task). The comparison to MAMAdapter shows that using hypernetwork to tune each transformer block and learn the shared knowledge across multitasks leads to an improvement.

6.2. Few-shot Domain Transfer

We use the above models trained on GLUE as reported in Table 1, and evaluate them on the test set of five different tasks after being few-shot finetuned on each target training data. Following Mahabadi et al. (2021), we use the task embedding respectively trained on the most similar GLUE task for initialization, i.e., MNLI for CB, QNLI for QA, SST2 for sentiment analysis, and QQP for paraphrase detection.

As suggested by Perez et al. (2021) and Zhao & Schutze ¨ (2021), we randomly select the same number of samples from training and validation set, making it a reasonable fewshot scenario. Checkpoints are selected via early stopping on the selected validation set, and the stopping metric is the default metric for each task.

In the first three columns of Table 2, we show the results of full fine-tuning of T5BASE, HYPERFORMER++ (finetuning both hypernetworks and task embeddings) and our proposed HyperPELT. Overall, our method achieves the best performance in the few-shot tasks.

For the tasks of CB and BoolQ from SuperGLUE, even though the backbone T5 was previously trained on the train sets of these two, the performance of all methods differs a lot. The two baselines still do not work with very few samples, like 4 and 16 samples, while our method is significantly better than them. Therefore, we assume that the two baselines suffer from catastrophic forgetting problems to some degree during multi-task training. In contrast, our proposed HyperPELT works effectively on these two tasks. We speculate that the reason might be the use of hypernetworks on both prefix-tuning and adapter-tuning modules of transformer. We leave this exploration to our future work.

6.3. Results on the Vision-Language Benchmarks

we move to the experiments of applying the proposed hypernetwork-based parameter-efficient training framework to V&L tasks. We compare to the pre-trained and full fine-tuning VL-T5 (Cho et al., 2021), and other adapter-based methods built on top of T5, i.e., CLIP-T5 and VLAdapter (Sung et al., 2021) in the multi-task training setting.

To our best knowledge, we are the first to employ the visual modality to tune the very few parameters of different transformer blocks, instead of normally inserting image patch tokens to the input sequence. Experimental results evidence the effectiveness of our novel approach, thus providing a new perspective on how to extend the multi-modality capability on top of PLMs. It is to use the features from different modalities as the input of a hypernetwork to generate parameters for modules in PLMs, instead of as a part of input sequence to accomplish the multimodal tasks. One advantage in our approach is still keeping the original maximum text input length, since no other modalities such as visual and audio features occupy it. It is promisingly useful in document-level and text-heavy tasks such as multimodal summarization (Zhang et al., 2022).

We believe the resulting performance might be even better with a more complex design combination of methods across tuning task-specific and visual-specific parameters in PLMs, but we leave this exploration in future work.

6.4. Multimodal Few-shot Learning

We further use the models trained on V&L tasks as reported in Table 4 and evaluate them on the test set after few-shot fine-tuning on OKVQA (Marino et al., 2019) and SNLIVE (Xie et al., 2018). For OKVQA, since there is no test set, we split its original validation set into two halves, one for validating and the other for testing. For SNLI-VE, we use its validation set for validating, and test-P set for testing and reporting results. We follow the methods in Section 6.2 to select samples and report results in Table 4.

Compared with the full parameter fine-tuning, i.e., CLIP-T5, and the other baseline VL-Adapter, our method achieves the best performance with smaller variances in this multimodal few-shot learning setting. We find that VL-Adapter is inferior to CLIP-T5 when with fewer samples (e.g., fewer than 500) on the OKVQA dataset. The reason may be that there exists a lot of out-domain knowledge and complex image content in OKVQA, which makes it more challenging for the parameter-efficient VL-Adapter to achieve accurate prediction. In other words, the small number of samples are not enough to train the introduced randomly initialized parameters in VL-Adapter.

However, our approach can still tackle with fewer samples. We use the hypernetwork to generate trainable parameters in adapters and multi-head attention, as well as directly integrating image features into attention modules in the form of prefix tuning vectors. We believe such method, though training less parameters, can still capture knowledge across tasks and transfer them in a few-shot setting. It is also worth noting that for the used five random seeds, the variance of our method is generally smaller than VL-Adapter, which indicates that our method is more robust in this few-shot learning scenario.

7. Discussion and Conclusion

In this paper, we propose a unified parameter-efficient tuning framework for multitasks, particularly on both pure language and vision-and-language (i.e., V&L) tasks. On the one hand, we use a hypernetwork to reduce the scale of trainable parameters of existing adapter-tuning and prefix-tuning modules. On the other hand, for the V&L tasks, we directly integrate the image features into the multi-head attention in the form of prefix vectors, which further reduces the number of trainable parameters for processing visual input. Extensive experiments on pure language and V&L tasks demonstrate the superiority of our proposed framework in both multi-tasking and few-shot settings. In the future, we plan to explore more combination of methods across tuning task-specific and visual-specific parameters for different modules in a pretrained language model.

References

Aharoni, R., Johnson, M., and Firat, O. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3874–3884, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1388. URL https: //aclanthology.org/N19-1388.

Cho, J., Lei, J., Tan, H., and Bansal, M. Unifying visionand-language tasks via text generation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 1931–1942. PMLR, 2021. URL http://proceedings.mlr.press/ v139/cho21a.html.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/ 10.18653/v1/n19-1423.

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6325–6334. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.670. URL https://doi.org/10.1109/CVPR.2017.670.

Ha, D., Dai, A. M., and Le, Q. V. Hypernetworks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum? id=rkpACe1lx.

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. CoRR, abs/2110.04366, 2021. URL https: //arxiv.org/abs/2110.04366.

He, Y., Zheng, H. S., Tay, Y., Gupta, J. P., Du, Y., Aribandi, V., Zhao, Z., Li, Y., Chen, Z., Metzler, D., Cheng, H., and Chi, E. H. Hyperprompt: Prompt-based task-conditioning of transformers. CoRR, abs/2203.00759, 2022. URL https://arxiv.org/abs/2203.00759.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for NLP. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2790–2799. PMLR, 2019. URL http://proceedings.mlr.press/ v97/houlsby19a.html.

Hudson, D. A. and Manning, C. D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6700–6709. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00686. URL http: //openaccess.thecvf.com/content_CVPR_ 2019/html/Hudson_GQA_A_New_Dataset_ for_Real-World_Visual_Reasoning_and_ Compositional_CVPR_2019_paper.html.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(1):32–73, 2017. doi: 10.1007/ s11263-016-0981-7. URL https://doi.org/10. 1007/s11263-016-0981-7.

Lee, J., Tang, R., and Lin, J. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090, 2019.

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021. URL https://arxiv.org/ abs/2104.08691.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 7871–7880. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.703. URL https://doi. org/10.18653/v1/2020.acl-main.703.

Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4582–4597. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.353. URL https://doi. org/10.18653/v1/2021.acl-long.353.

Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. Microsoft ´ COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pp. 740–755. Springer, 2014. doi: 10.1007/ 978-3-319-10602-1\ 48. URL https://doi.org/ 10.1007/978-3-319-10602-1_48.

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. CoRR, abs/2107.13586, 2021a. URL https://arxiv. org/abs/2107.13586.

Liu, Y., Agarwal, S., and Venkataraman, S. Autofreeze: Automatically freezing model blocks to accelerate finetuning. arXiv preprint arXiv:2102.01386, 2021b.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R. (eds.), The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19- 24 June, 2011, Portland, Oregon, USA, pp. 142–150. The Association for Computer Linguistics, 2011. URL https://aclanthology.org/P11-1015/.

Mahabadi, R. K., Ruder, S., Dehghani, M., and Henderson, J. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 565– 576. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.47. URL https:// doi.org/10.18653/v1/2021.acl-long.47.

Mao, Y., Mathias, L., Hou, R., Almahairi, A., Ma, H., Han, J., Yih, W., and Khabsa, M. Unipelt: A unified framework for parameter-efficient language model tuning. CoRR, abs/2110.07577, 2021. URL https://arxiv.org/ abs/2110.07577.

Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 3195–3204. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00331. URL http://openaccess.thecvf.com/content_ CVPR_2019/html/Marino_OK-VQA_A_ Visual_Question_Answering_Benchmark_ Requiring_External_Knowledge_CVPR_ 2019_paper.html.

Perez, E., Kiela, D., and Cho, K. True few-shot learning with language models. CoRR, abs/2105.11447, 2021. URL https://arxiv.org/abs/2105.11447.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 8748– 8763. PMLR, 2021. URL http://proceedings. mlr.press/v139/radford21a.html.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074. html.

Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N. V., Datta, D., Chang, J., Jiang, M. T., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Biderman, ´ S., Gao, L., Bers, T., Wolf, T., and Rush, A. M. Multitask prompted training enables zero-shot task generalization. CoRR, abs/2110.08207, 2021. URL https: //arxiv.org/abs/2110.08207.

Sung, Y.-L., Cho, J., and Bansal, M. Vl-adapter: Parameterefficient transfer learning for vision-and-language tasks. 2021.

Tsimpoukelli, M., Menick, J., Cabi, S., Eslami, S. M. A., Vinyals, O., and Hill, F. Multimodal fewshot learning with frozen language models. CoRR, abs/2106.13884, 2021. URL https://arxiv.org/ abs/2106.13884.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998– 6008, 2017. URL https://proceedings. neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract. html.

von Oswald, J., Henning, C., Sacramento, J., and Grewe, B. F. Continual learning with hypernetworks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview. net/forum?id=SJgwNerKvB.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Superglue: A stickier benchmark for general-purpose language understanding systems. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E. B., and ´ Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3261– 3275, 2019a. URL https://proceedings. neurips.cc/paper/2019/hash/ 4496bf24afe7fab6f046bf4923da8de6-Abstract. html.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019b. URL https://openreview.net/ forum?id=rJ4km2R5t7.

Xie, N., Lai, F., Doran, D., and Kadav, A. Visual entailment task for visually-grounded language learning. CoRR, abs/1811.10582, 2018. URL http://arxiv.org/ abs/1811.10582.

Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., and Wang, L. An empirical study of GPT-3 for few-shot knowledge-based VQA. CoRR, abs/2109.05014, 2021. URL https://arxiv.org/abs/2109.05014.

Zhang, T., Wu, F., Katiyar, A., Weinberger, K. Q., and Artzi, Y. Revisiting few-sample BERT fine-tuning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/ forum?id=cO1IH43yUF.

Zhang, Y., Baldridge, J., and He, L. PAWS: paraphrase adversaries from word scrambling. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 1298–1308. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1131. URL https://doi.org/10.18653/v1/n19-1131.

Zhang, Z., Meng, X., Wang, Y., Jiang, X., Liu, Q., and Yang, Z. Unims: A unified framework for multimodal summarization with knowledge distillation. AAAI, 2022.

Zhao, M. and Schutze, H. Discrete and soft prompting for ¨ multilingual models. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 8547–8555. Association for Computational Linguistics, 2021. doi: 10.18653/ v1/2021.emnlp-main.672. URL https://doi.org/ 10.18653/v1/2021.emnlp-main.672.

Authors:

(1) Zhengkun Zhang, with Equal contribution from Work is done at the internship of Noah’s Ark Lab, Huawei Technologies

(2) Wenya Guo and TKLNDST, CS, Nankai University, China ([email protected]);

(3) Xiaojun Meng, with Equal contribution from Noah’s Ark Lab, Huawei Technologies;

(4) Yasheng Wang, Noah’s Ark Lab, Huawei Technologies;

(5) Yadao Wang, Noah’s Ark Lab, Huawei Technologies;

(6) Xin Jiang, Noah’s Ark Lab, Huawei Technologies;

(7) Qun Liu, Noah’s Ark Lab, Huawei Technologies;

(8) Zhenglu Yang, TKLNDST, CS, Nankai University, China.

This paper is available on arxiv under CC BY 4.0 DEED license.

[4] For adapting Prefix-tuning from BART to T5, a noteworthy point is that since they use different position embedding, i.e., absolute position embedding for BART and relative position embedding for T5, it is necessary to manually concatenate all-zero vectors on the relative position bias of each layer in T5.