Authors:
(1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution;
(2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Table of Links Abstract and 1 Introduction 2 State Space Models 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3.2 Improving SSMs with Selection 3.3 Efficient Implementation of Selective SSMs 3.4 A Simplified SSM Architecture 3.5 Properties of Selection Mechanisms 3.6 Additional Model Details 4 Empirical Evaluation and 4.1 Synthetic Tasks 4.2 Language Modeling 4.3 DNA Modeling 4.4 Audio Modeling and Generation 4.5 Speed and Memory Benchmarks 4.6 Model Ablations 5 Discussion 6 Conclusion and References A Discussion: Selection Mechanism B Related Work C Mechanics of Selective SSMs D Hardware-aware Algorithm For Selective SSMs E Experimental Details and Additional Results D Hardware-aware Algorithm For Selective SSMs Without input-dependent selectivity, SSMs can be efficiently implemented as a convolution (Dao, Fu, Saab, et al. 2023; Gu, Goel, and Ré 2022), which leverages the fast Fourier transform (FFT) as primitive. With selectivity, SSMs are no-longer equivalent to convolution, but we leverage the parallel associative scan. While SSM scans are theoretically efficient (푂(퐵퐿퐷푁) FLOPs, scaling linear in L), training foundation models with selective SSMs requires them to be efficient on modern hardware (GPUs) as well. We describe how we use kernel fusion and recomputation to make SSM scan fast and memory-efficient. We evaluate the speed of our scan implementation compared to convolution and attention in Section 4.5, showing that it is up to 7× times faster than attention at sequence length 32K, and is as memory-efficient as the best attention implementation (FlashAttention). Speed. On modern hardware accelerators (GPUs) most operations (except matrix multiply) are bounded by memory-bandwidth (Dao, Fu, Ermon, et al. 2022; Ivanov et al. 2021; Williams, Waterman, and Patterson 2009). This the case with our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to significant speedup compared to a standard implementation. This way, we reduce IOs by a factor of 푂(푁) (the state dimension), which in practice speeds up the operation by 20-40 times (Section 4.5). For sequence length L too long where we cannot fit the sequence in SRAM (which is much smaller than HBM), we split the sequences into chunks and perform the fused scan on each chunk. As long as we have the intermediate scan states, we can continue the scan with the next chunk. Memory. We describe how we use the classical technique of recomputation to reduce the total amount of memory required to train selective SSM layers. From the way we fuse the forward pass, we do not save the intermediate states of size (B, L, D, N) to avoid memory blowup. However, these intermediate states are necessary for the backward pass to compute gradients. We instead recompute those intermediate states in the backward pass. Since the inputs ∆, A, B, C and output gradient read from HBM to SRAM are of size O(BLN + DN), and the input gradients are also of size O(BLN + DN), recomputation avoids the cost of reading O(BLND) elements from HBM. This means that recomputation of the SSM states in the backward pass speeds up the computation compared to storing them and reading them from HBM. Beyond optimizing for the memory requirement of just the scan operation, we also use recomputation to optimize the memory requirement of the entire selective SSM block (input projection, convolution, activation, scan, output projection). In particular, we do not save intermediate activations that take a lot of memory but are fast to recompute (e.g. output of activation function or short convolution). As a result, the selective SSM layer has the same memory requirement as an optimized Transformer implementation with FlashAttention. In particular, each attention layer (FlashAttention) stores around 12 bytes of activations per token, an each MLP layer stores around 20 bytes of activations per token, for a total of 32 bytes ((assuming mixed-precision training in FP16 or BF16)). Each selective SSM stores around 16 bytes of activations per token. Hence two layers of selective SSMs have around the same activation memory as an attention layer and an MLP layer. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution; (2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Authors: Authors: (1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution; (2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 State Space Models 2 State Space Models 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3.2 Improving SSMs with Selection 3.2 Improving SSMs with Selection 3.3 Efficient Implementation of Selective SSMs 3.3 Efficient Implementation of Selective SSMs 3.4 A Simplified SSM Architecture 3.4 A Simplified SSM Architecture 3.5 Properties of Selection Mechanisms 3.5 Properties of Selection Mechanisms 3.6 Additional Model Details 3.6 Additional Model Details 4 Empirical Evaluation and 4.1 Synthetic Tasks 4 Empirical Evaluation and 4.1 Synthetic Tasks 4.2 Language Modeling 4.2 Language Modeling 4.3 DNA Modeling 4.3 DNA Modeling 4.4 Audio Modeling and Generation 4.4 Audio Modeling and Generation 4.5 Speed and Memory Benchmarks 4.5 Speed and Memory Benchmarks 4.6 Model Ablations 4.6 Model Ablations 5 Discussion 5 Discussion 6 Conclusion and References 6 Conclusion and References A Discussion: Selection Mechanism A Discussion: Selection Mechanism B Related Work B Related Work C Mechanics of Selective SSMs C Mechanics of Selective SSMs D Hardware-aware Algorithm For Selective SSMs D Hardware-aware Algorithm For Selective SSMs E Experimental Details and Additional Results E Experimental Details and Additional Results D Hardware-aware Algorithm For Selective SSMs Without input-dependent selectivity, SSMs can be efficiently implemented as a convolution (Dao, Fu, Saab, et al. 2023; Gu, Goel, and Ré 2022), which leverages the fast Fourier transform (FFT) as primitive. With selectivity, SSMs are no-longer equivalent to convolution, but we leverage the parallel associative scan. While SSM scans are theoretically efficient (푂(퐵퐿퐷푁) FLOPs, scaling linear in L), training foundation models with selective SSMs requires them to be efficient on modern hardware (GPUs) as well. We describe how we use kernel fusion and recomputation to make SSM scan fast and memory-efficient. We evaluate the speed of our scan implementation compared to convolution and attention in Section 4.5, showing that it is up to 7× times faster than attention at sequence length 32K, and is as memory-efficient as the best attention implementation (FlashAttention). Speed . On modern hardware accelerators (GPUs) most operations (except matrix multiply) are bounded by memory-bandwidth (Dao, Fu, Ermon, et al. 2022; Ivanov et al. 2021; Williams, Waterman, and Patterson 2009). This the case with our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to significant speedup compared to a standard implementation. Speed This way, we reduce IOs by a factor of 푂(푁) (the state dimension), which in practice speeds up the operation by 20-40 times (Section 4.5). For sequence length L too long where we cannot fit the sequence in SRAM (which is much smaller than HBM), we split the sequences into chunks and perform the fused scan on each chunk. As long as we have the intermediate scan states, we can continue the scan with the next chunk. Memory . We describe how we use the classical technique of recomputation to reduce the total amount of memory required to train selective SSM layers. Memory From the way we fuse the forward pass, we do not save the intermediate states of size (B, L, D, N) to avoid memory blowup. However, these intermediate states are necessary for the backward pass to compute gradients. We instead recompute those intermediate states in the backward pass. Since the inputs ∆, A, B, C and output gradient read from HBM to SRAM are of size O(BLN + DN), and the input gradients are also of size O(BLN + DN), recomputation avoids the cost of reading O(BLND) elements from HBM. This means that recomputation of the SSM states in the backward pass speeds up the computation compared to storing them and reading them from HBM. Beyond optimizing for the memory requirement of just the scan operation, we also use recomputation to optimize the memory requirement of the entire selective SSM block (input projection, convolution, activation, scan, output projection). In particular, we do not save intermediate activations that take a lot of memory but are fast to recompute (e.g. output of activation function or short convolution). As a result, the selective SSM layer has the same memory requirement as an optimized Transformer implementation with FlashAttention. In particular, each attention layer (FlashAttention) stores around 12 bytes of activations per token, an each MLP layer stores around 20 bytes of activations per token, for a total of 32 bytes ((assuming mixed-precision training in FP16 or BF16)). Each selective SSM stores around 16 bytes of activations per token. Hence two layers of selective SSMs have around the same activation memory as an attention layer and an MLP layer. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Hardware-Aware Algorithm for Selective State Space Models

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Simplified State Space Model Architecture

Princeton and CMU Push AI Boundaries with the Mamba Sequence Model

How State Space Models Improve AI Sequence Modeling Efficiency

Why Compressing Information Helps AI Work Better

How Selection Mechanisms Transform State Space Models

Cutting-Edge Techniques That Speed Up AI Without Extra Costs

A Simplified State Space Model Architecture

Princeton and CMU Push AI Boundaries with the Mamba Sequence Model

How State Space Models Improve AI Sequence Modeling Efficiency

Why Compressing Information Helps AI Work Better

How Selection Mechanisms Transform State Space Models

Cutting-Edge Techniques That Speed Up AI Without Extra Costs

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps