Table of Links
2. Related Work
The early-bird ticket hypothesis was first introduced by Frankle et al. [5] in the context of CNNs. They discovered that subnetworks capable of matching the performance of fully-trained networks could be identified early in the training process. This finding has led to the development of various techniques to identify and exploit early-bird tickets in CNNs [1, 13]. In the domain of Transformers, there have been limited explorations of the early-bird ticket hypothesis. One notable work is EarlyBERT by Kovaleva et al. [2], which investigated the applicability of the early-bird ticket hypothesis to BERT. They found that early-bird tickets exist in BERT and can be used to optimize the fine-tuning process. However, their work focused solely on BERT and did not provide a comparative analysis across different Transformer architectures. Other works have explored various techniques to optimize the training and inference of Transformer models. For example, Michel et al. [8] proposed a method to prune attention heads in Transformers, reducing the computational requirements while maintaining performance. Sanh et al. [9] introduced DistilBERT, a distilled version of BERT that achieves comparable performance with fewer parameters and faster inference times. Despite these efforts, the potential speedup and resource optimization achievable through the early-bird ticket hypothesis in Transformers have not been fully explored. Many existing works rely on the slow and rigorous process of the train-prune-retrain methodology [6], which can be time-consuming and resource-intensive. In this research, we aim to address these limitations by investigating the early-bird ticket hypothesis across different Transformer architectures, including vision transformers and language models. We explore efficient methods to identify early-bird tickets and evaluate their performance in comparison to fully-trained models. Our goal is to provide insights into the applicability of the early-bird ticket hypothesis in Transformers and contribute to the development of more efficient training strategies for these powerful models.
Author:
(1) Shravan Cheekati, Georgia Institute of Technology ([email protected]).
This paper is