This story draft by @escholar has not been reviewed by an editor, YET.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces: DNA Modeling

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Albert Gu, Machine Learning Department, Carnegie Mellon University with Equal contribution ([email protected]);

(2) Tri Dao, Department of Computer Science, Princeton University with Equal contribution ([email protected]).

Table of Links

Abstract and 1. Introduction

2 State Space Models

3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression

3.2 Improving SSMs with Selection

3.3 Efficient Implementation of Selective SSMs

3.4 A Simplifed SSM Architecture

3.5 Properties of Selection Mechanisms

3.6 Additional Model Details

4 Empirical Evaluation and 4.1 Synthetic Tasks

4.2 Language Modeling

4.3 DNA Modeling

4.4 Audio Modeling and Generation

4.5 Speed and Memory Benchmarks

4.6 Model Ablations

5 Discussion

6 Conclusion, Acknowledgments and References

A Discussion: Selection Mechanism

B Related Work and B.1 S4 Variants and Derivatives

B.2 SSM Architectures

B.3 Relationship to RNNs

B.4 Linear Attention and B.5 Long Context Models

C Mechanics of Selective SSMs

D Hardware-aware Algorithm For Selective SSMs

E Experimental Details and Additional Results and E.1 Synthetic Tasks

E.2 Language Modeling

E.3 DNA Modeling

E.4 Audio Details

E.5 Efficiency Benchmark

E.3 DNA Modeling

E.3.1 Pretraining Details

We describe the dataset and training procedure of the HG38 pretraining task in more detail.

E.3.2 Scaling: Model Size Details

Models. The models we consider are:


• Transformer++: a Transformer with improved architecture, notably the usage of RoPE positional encodings (Su et al. 2021). Informally, we found these to be noticeably better than vanilla positional encodings from (Vaswani et al. 2017).


• HyenaDNA: the Hyena model from Nguyen, Poli, et al. (2023) and Poli et al. (2023), which is roughly a Transformer with the MHA block replaced by an H3 block using a global convolution parameterized by an MLP.


• Mamba: the standard Mamba architecture.


Model Sizes. We use the following model sizes.

Note that the number of blocks for Mamba is doubled, because one Transformer “layer” includes both the MHA and MLP blocks (and similarly for Hyena), which requires two Mamba blocks to match parameters (Section 3.4).

Note that, in contrast to standard LM scaling laws (Table 12), our LR held constant across model sizes for simplicity. The optimal LR should go down for larger models, but we didn’t find a noticeable effect at the small model sizes (at most a few million parameters) we considered.

E.3.3 Scaling: Context Length Details

Remark E.1. We also note that the schedule was not tuned, and we never experimented with turning o sequence length warmup for these pretraining experiments. We later found that SLW did not help noticeably for audio pretraining at similar lengths (Section 4.4), and it is possible that it is not necessary for DNA pretraining either.

E.3.4 Species (Great Apes) Classication

Training consists of 10 epochs, each of which has 1024 gradient steps. Each gradient step uses batch size 64, which are all independently randomly drawn by uniformly picking a species, uniformly picking a chromosome, and then uniformly picking a contiguous segment of DNA.

Results for the Species classification task are in Table 13.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks