Authors:
(1) Jialin Zhao, Center for Complex Network Intelligence (CCNI), Tsinghua Laboratory of Brain and Intelligence (THBI) and Department of Computer Science;
(2) Yingtao Zhang, Center for Complex Network Intelligence (CCNI), Tsinghua Laboratory of Brain and Intelligence (THBI) and Department of Computer Science;
(3) Xinghang Li, Department of Computer Science;
(4) Huaping Liu, Department of Computer Science;
(5) Carlo Vittorio Cannistraci, Center for Complex Network Intelligence (CCNI), Tsinghua Laboratory of Brain and Intelligence (THBI), Department of Computer Science, and Department of Biomedical Engineering Tsinghua University, Beijing, China.
Table of Links
-
Low Rank Adaptation
-
Sparse Spectral Training
4.1 Preliminaries and 4.2 Gradient Update of U, VT with Σ
4.3 Why SVD Initialization is Important
4.4 SST Balances Exploitation and Exploration
4.5 Memory-Efficient Implementation for SST and 4.6 Sparsity of SST
Supplementary Information
A. Algorithm of Sparse Spectral Training
B. Proof of Gradient of Sparse Spectral Layer
C. Proof of Decomposition of Gradient of Weight
D. Proof of Advantage of Enhanced Gradient over Default Gradient
E. Proof of Zero Distortion with SVD Initialization
H. Evaluating SST and GaLore: Complementary Approaches to Memory Efficiency
Abstract
The growing computational demands posed by increasingly number of neural network’s parameters necessitate low-memory-consumption training approaches. Previous memory reduction techniques, such as Low-Rank Adaptation (LoRA) and ReLoRA, suffer from the limitation of low rank and saddle point issues, particularly during intensive tasks like pre-training. In this paper, we propose Sparse Spectral Training (SST), an advanced training methodology that updates all singular values and selectively updates singular vectors of network weights, thereby optimizing resource usage while closely approximating full-rank training. SST refines the training process by employing a targeted updating strategy for singular vectors, which is determined by a multinomial sampling method weighted by the significance of the singular values, ensuring both high performance and memory reduction. Through comprehensive testing on both Euclidean and hyperbolic neural networks across various tasks, including natural language generation, machine translation, node classification and link prediction, SST demonstrates its capability to outperform existing memory reduction training methods and is comparable with full-rank training in some cases. On OPT-125M, with rank equating to 8.3% of embedding dimension, SST reduces the perplexity gap to full-rank training by 67.6%, demonstrating a significant reduction of the performance loss with prevalent low-rank methods. This approach offers a strong alternative to traditional training techniques, paving the way for more efficient and scalable neural network training solutions.
1 Introduction
The development and scaling up of the size of large language models [1–3] pose great challenges to the feasibility of training large language models from scratch. Normal training methods that update all parameters of models become extremely expensive due to their extensive memory requirements.
Recent developments in parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA) [4], have sought to mitigate the challenge of fine-tuning memory requirements by introducing trainable low-rank matrices that efficiently reduced memory footprint. However, the constraint of the predetermined rank can severely limit the ability of a model to capture and represent complex data patterns, leading to suboptimal performance, especially in the pre-training stages. The recent improvements of ReLoRA [5] and Chain of LoRA [6] break the limitation of low-dimension search space. However, they will still suffer from saddle point issues. Saddle points are locations where the gradient is zero but are not true minima, potentially leading to slower and less effective convergence compared to full-rank models during pre-training.
In response to these challenges, we introduce Sparse Spectral Training (SST), a new training framework designed to optimize memory consumption while closely approximating the overall learning dynamics and performance of full-rank training. Unlike previous methods [4, 5, 7, 8] that primarily focus on updating only a partial number of parameters, SST adopts a more effective approach by updating all singular values. SST also capitalizes the intrinsic spectral properties of the weight matrices, focusing updates on components that are most influential to the model’s learning process based on their singular values. Additionally, SST proposes to use singular value decomposition to initialize low-rank parameters, minimizing distortion compared to full-rank training.
Our comprehensive evaluations across different tasks including pre-training large language models on OPT model family from 125m to 1.3b [9], Transformer [10] on machine translation tasks and hyperbolic graph neural networks [11, 12] on node classification and link prediction tasks. The empirical performance shows that with rank equals to 6.25% of model dimension, SST outperforms full-rank training on machine translation tasks and obtains SOTA performance among prevalent parameter-efficient training methods. Furthermore, we are the first to embed the parameter-efficient training process on hyperbolic space, which proves that SST is a general technique applicable across various data structures and models, effectively enhancing the adaptability and scalability of neural network training in resource-constrained environments.
2 Related Work
Low-Rank Adaptation Low-rank adaptation has become a key strategy for reducing the computational and memory requirements of training large-scale neural networks. Hu et al. [4] introduced Low-Rank Adaptation (LoRA), a technique that fine-tunes pre-trained models by integrating low-rank matrices to significantly reduce the number of parameters updated during training. Various enhancements to LoRA have since been developed to improve its efficiency and broaden its application [7, 13–15]. Lialin et al. [5] introduced ReLoRA specifically for the pre-training phase, which requires a full-rank warm-up to achieve similar performance with full-rank training. A similar approach is found in COLA [6]. Additionally, Zhao et al. [16] introduced GaLore, which project gradient to low-rank subspace. These advancements highlight the versatility and ongoing evolution of low-rank adaptation techniques in response to the growing complexity of neural network models.
Other Parameter-Efficient Training Methods Apart from low-rank adaptations, researchers have developed a variety of parameter-efficient training techniques to optimize resource consumption while preserving learning effectiveness. Prompt tuning is an effective method that integrates tunable prefixes or soft prompts into the input embeddings of models. It enables lightweight task-specific adaptations with minimal impact on the model’s overall architecture [17, 18]. Dynamic sparse training (DST), through methods like SET [19], RIGL [20], MEST [21], and CHT [22], employs a dynamic prune-and-grow strategy that adjusts network topology during training. This approach optimizes training efficiency and can improve generalization by continuously adapting the network’s sparse structure. This presents a significant shift from static training methods.
Hyperbolic Neural Networks Hyperbolic neural networks are an emerging field in deep learning, exploiting the unique properties of hyperbolic space that make it ideal for processing hierarchical and graph-structured data [23, 24]. Innovations in this area have adapted fundamental neural network mechanisms to function within hyperbolic geometries, as demonstrated by Muscoloni et al. [23] and Ganea et al. [25]. Further developments by Chen et al. [12] explore manifold-specific properties to enrich both theoretical understanding and practical deployment. The employment of hyperbolic spaces has been shown to significantly improve data representation and generalization across various tasks, marking a notable advancement in managing complex, non-Euclidean data structures [26–28].
This paper is
