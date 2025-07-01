Abstract and 1. Introduction

5 Related Works

5.1 Pre-training and Fine-tuning

The standard practice of pre-training and fine-tuning [13,17,60,70] entails models initially undergoing pre-training on datasets such as ImageNet-21K, BookCorpus, and Common Crawl [46, 51, 79]. Subsequently, these models are fine-tuned to enhance their convergence and performance on specific tasks [12].





In the realm of parameter-efficient fine-tuning [78], various approaches have been proposed. LoRA [16] fine-tunes lower-rank matrices at each layer to represent weight updates. The adapter [15] approach inserts small modules between layers and reduces parameters by only tuning these adapters [3,19,28,74]. Visual prompt tuning (VPT) [18, 58] has introduced a limited number of learnable parameters for optimization while keeping the backbone frozen. SSF [30] proposes scaling and shifting deep features extracted by a pre-trained model.

5.2 Model Architectures

Compared with transformer-based models [5, 31, 61, 73], convolution has been used for a long time as the main module to extract the image features in computer vision tasks. With an inductive prior, convolution-based models require fewer training images and computation resources to achieve good generalization. Convolution-based architectures have been largely studied [13, 32, 57] and have found multiple applications, such as feature extracting [48], image generation [20, 59], super-resoluton [68], and et cetera. Numerous studies explore the integration of convolutional techniques with vision transformers to enhance their performance [10,47]. Parameter-efficient fine-tuning in downstream tasks is crucial and requires further examinations when utilizing pre-trained large-scale convolution-based models.

5.3 Discriminative and Generative Tasks

Discriminative and generative tasks represent fundamental concepts in machine learning. Discriminative models [11, 13, 39, 80] are designed to distinguish between various data instances, while generative models [20, 48, 59, 68] are employed to create new data instances. Discriminative models have been applied to image classifications [13, 32, 57], object detection [39, 80], and semantic segmentation [11]. Generative models have been extensively studied for image synthesis, including variational autoencoder [22,48,63,65], diffusion [4,49,59], and autoregressive models [37, 42, 64].





In this study, our primary focus is on implementing parameter-efficient finetuning techniques for two tasks: image classification using ConvNeXt [32] and image synthesis employing Stable Diffusion [49].





6 Conclusion

In this work, we proposed the parameter-efficient fine-tuning method for large convolutional models by formulating the convolutional layers over the filter subspace. Fine-tuning filter atoms composed of a small number of parameters and keeping the atom coefficients unchanged, is notably efficient in terms of parameters. It successfully maintains the capabilities of pre-trained models while avoiding overfitting to downstream tasks. We then formulate a simple yet effective way to achieve an overcomplete filter subspace by decomposing each filter atom over another set of filter atoms, thereby expanding the parameter space available for fine-tuning as needed. Our approach has demonstrated effectiveness in different configurations on both discriminate and generative tasks.





Limitations. Our method, which concentrates on tuning models within the filter subspace, is particularly advantageous for ConvNets. While it can be naturally extended to linear layers through appropriate mathematical formulations, the full potential of our approach when applied to linear layers remains underexplored.





References

This paper is available on arxiv under CC BY 4.0 DEED license.



