This story draft by @escholar has not been reviewed by an editor, YET.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces: Model Ablations

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Albert Gu, Machine Learning Department, Carnegie Mellon University with Equal contribution ([email protected]);

(2) Tri Dao, Department of Computer Science, Princeton University with Equal contribution ([email protected]).

Table of Links

Abstract and 1. Introduction

2 State Space Models

3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression

3.2 Improving SSMs with Selection

3.3 Efficient Implementation of Selective SSMs

3.4 A Simplifed SSM Architecture

3.5 Properties of Selection Mechanisms

3.6 Additional Model Details

4 Empirical Evaluation and 4.1 Synthetic Tasks

4.2 Language Modeling

4.3 DNA Modeling

4.4 Audio Modeling and Generation

4.5 Speed and Memory Benchmarks

4.6 Model Ablations

5 Discussion

6 Conclusion, Acknowledgments and References

A Discussion: Selection Mechanism

B Related Work and B.1 S4 Variants and Derivatives

B.2 SSM Architectures

B.3 Relationship to RNNs

B.4 Linear Attention and B.5 Long Context Models

C Mechanics of Selective SSMs

D Hardware-aware Algorithm For Selective SSMs

E Experimental Details and Additional Results and E.1 Synthetic Tasks

E.2 Language Modeling

E.3 DNA Modeling

E.4 Audio Details

E.5 Efficiency Benchmark

4.6 Model Ablations

4.6.1 Architecture

Table 6 investigates the effects of the architecture (block) and its inner SSM layer (Figure 3). We find that


• Among previous non-selective (LTI) SSMs, which are equivalent to global convolutions, performance is very similar.


• Replacing the complex-valued S4 variant from previous work with a real-valued one does not affect performance much, suggesting that (at least for LM) real-valued SSMs may be a better choice when accounting for hardware efficiency.


• Replacing any of these with a selective SSM (S6) significantly improves performance, validating the motivation of Section 3.


• The Mamba architecture performs similarly to the H3 architecture (and seems slightly better when using a selective layer).


We also investigate interleaving the Mamba block with other blocks such as MLP (a traditional architecture) MHA (a hybrid attention architecture) in Appendix E.2.2

4.6.2 Selective SSM

Table 7 ablates the selective SSM layer by considering different combinations of selective ∆, B, and C parameters (Algorithm 2), showing that ∆ is the most important parameter due to its connection to RNN gating (Theorem 1).


Table 8 considers different initializations of the SSM, which have been shown to make a large difference in some data modalities and settings (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022). On language modeling, we find that simpler real-valued diagonal initializations (S4D-Real, row 3) instead of more standard complex-valued parameterizations (S4D-Lin, row 1) perform better. Random initializations also work well, consistent with findings from prior work (Mehta et al. 2023).


Table 9 and Table 10 consider varying the dimension of the ∆ and (B, C) projections respectively. Changing them from static to selective provides the most benefit, while increasing the dimensions further generally improves performance modestly with a small increase in parameter count.


Of particular note is the dramatic improvement of the selective SSM when the state size 푁 is increased, with over a 1.0 perplexity improvement for a cost of only 1% additional parameters. This validates our core motivation in Sections 3.1 and 3.3.


Table 6: (Ablations: Architecture and SSM layer.) The Mamba block performs similarly to H3 while being simpler. In the inner layer, there is little dierence among dierent parameterizations of LTI models, while selective SSMs (S6) provide a large improvement. More specically, the S4 (real) variant is S4D-Real and the S4 (complex) variant is S4D-Lin.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks