paint-brush
Mamba’s Performance in DNA, Audio, and Speed Benchmarksby@rendering

Mamba’s Performance in DNA, Audio, and Speed Benchmarks

by Rendering Technology Breakthroughs
Rendering Technology Breakthroughs HackerNoon profile picture

Rendering Technology Breakthroughs

@rendering

Research and publications on cutting-edge rendering technologies, shaping 2d &...

March 14th, 2025
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Mamba proves its strength in long-range dependencies, outperforming HyenaDNA in DNA sequence modeling and surpassing state-of-the-art speech generation models. Selective SSMs also demonstrate superior efficiency in AI training benchmarks.

People Mentioned

Mention Thumbnail

Machine Learning

@machinelearning2

Companies Mentioned

Mention Thumbnail
Abstract
Mention Thumbnail
Empirical
featured image - Mamba’s Performance in DNA, Audio, and Speed Benchmarks
1x
Read by Dr. One voice-avatar

Listen to this story

Rendering Technology Breakthroughs HackerNoon profile picture
Rendering Technology Breakthroughs

Rendering Technology Breakthroughs

@rendering

Research and publications on cutting-edge rendering technologies, shaping 2d & 3d visual experiences across industries.

About @rendering
LEARN MORE ABOUT @RENDERING'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Albert Gu, Machine Learning Department, Carnegie Mellon University with Equal contribution (agu@cs.cmu.edu);

(2) Tri Dao, Department of Computer Science, Princeton University with Equal contribution (tri@tridao.me).

Abstract and 1. Introduction

2 State Space Models

3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression

3.2 Improving SSMs with Selection

3.3 Efficient Implementation of Selective SSMs

3.4 A Simplifed SSM Architecture

3.5 Properties of Selection Mechanisms

3.6 Additional Model Details

4 Empirical Evaluation and 4.1 Synthetic Tasks

4.2 Language Modeling

4.3 DNA Modeling

4.4 Audio Modeling and Generation

4.5 Speed and Memory Benchmarks

4.6 Model Ablations

5 Discussion

6 Conclusion, Acknowledgments and References

A Discussion: Selection Mechanism

B Related Work and B.1 S4 Variants and Derivatives

B.2 SSM Architectures

B.3 Relationship to RNNs

B.4 Linear Attention and B.5 Long Context Models

C Mechanics of Selective SSMs

D Hardware-aware Algorithm For Selective SSMs

E Experimental Details and Additional Results and E.1 Synthetic Tasks

E.2 Language Modeling

E.3 DNA Modeling

E.4 Audio Details

E.5 Efficiency Benchmark

4.3 DNA Modeling

Motivated by the success of large language models, there has been recent exploration into using the foundation model paradigm for genomics. DNA has been likened to language in that it consists of sequences of discrete tokens with a finite vocab. It is also known for requiring long-range dependencies to model (Avsec et al. 2021). We investigate Mamba as a FM backbone for pretraining and fine-tuning in the same setting as recent works on long-sequence models for DNA (Nguyen, Poli, et al. 2023). In particular, we focus on two explorations of scaling laws across model size and sequence length (Figure 5), and a difficult downstream synthetic classification task requiring long context (Figure 6).


For pretraining, we largely follow a standard causal language modeling (next token prediction) setup for the training and model details (see also Appendix E.2). For the dataset, we largely follow the setup of HyenaDNA (Nguyen, Poli, et al. 2023), which uses the HG38 dataset for pretraining consisting of a single human genome with about 4.5 billion tokens (DNA base pairs) in the training split.

4.3.1 Scaling: Model Size

In this experiment, we investigate the scaling properties of genomics foundation models with various model backbones (Figure 5 Left).


Training. To advantage the baselines, we train on a short sequence length of 1024; as shown in Section 4.3.2, we expect results to favor Mamba even more at longer sequence lengths. We fix a global batch size of 1024, for a 12

Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size.

Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size.

4.3.2 Scaling: Context Length

image

Results. Figure 5 (Right) shows that Mamba is able to make use of longer context even up to extremely long sequences of length 1M, and its pretraining perplexity improves as the context increases. On the other hand, the HyenaDNA model gets worse with sequence length. This is intuitive from the discussion in Section 3.5 on properties of the selection mechanism. In particular, LTI models cannot selectively ignore information; from a convolutional perspective, a very long convolution kernel is aggregating all information across a long sequence

image

which may be very noisy. Note that while HyenaDNA claims to improve with longer context, their results do not control for computation time.

4.3.3 Synthetic Species Classification

We evaluate models on a downstream task of classifying between 5 different species by randomly sampling a contiguous segment of their DNA. This task is adapted from HyenaDNA, which used the species {human, lemur, mouse, pig, hippo}. We modify the task to be significantly more challenging by classifying between the five great apes species {human, chimpanzee, gorilla, orangutan, bonobo}, which are known to share 99% of their DNA


4.4 Audio Modeling and Generation

For the audio waveform modality, we compare primarily to the SaShiMi architecture and training protocols (Goel et al. 2022). This model comprises


image


We consider replacing the S4+MLP blocks with Mamba blocks. Experiment details are in Appendix E.4.

4.4.1 Long-Context Autoregressive Pretraining


image


4.4.2 Autoregressive Speech Generation

SC09 is a benchmark speech generation dataset (Donahue, McAuley, and Puckette 2019; Warden 2018), consisting of 1-second clips sampled at 16000 Hz of the digits “zero” through “nine” with highly variable characteristics. We largely follow the autoregressive training setup and generation protocol of Goel et al. (2022).


Table 4 shows automated metrics of the Mamba-UNet model compared to a variety of baselines from Goel et al. (2022): WaveNet (Oord et al. 2016), SampleRNN (Mehri et al. 2017), WaveGAN (Donahue, McAuley, and Puckette 2019), DiffWave (Z. Kong et al. 2021), and SaShiMi. A small Mamba model outperforms the state-of-the-art (and much larger) GAN- and diffusion- based models. A larger model parameter-matched to the baselines further improves on fidelity metrics dramatically.


image


4.5 Speed and Memory Benchmarks


image



Figure 8: (efficiency Benchmarks.) (Left) Training: our efficient scan is 40× faster than a standard implementation. (Right)Inference: as a recurrent model, Mamba can achieve 5× higher throughput than Transformers.

Figure 8: (efficiency Benchmarks.) (Left) Training: our efficient scan is 40× faster than a standard implementation. (Right)Inference: as a recurrent model, Mamba can achieve 5× higher throughput than Transformers.


4.6 Model Ablations


image


4.6.1 Architecture

Table 6 investigates the effects of the architecture (block) and its inner SSM layer (Figure 3). We find that


• Among previous non-selective (LTI) SSMs, which are equivalent to global convolutions, performance is very similar.


• Replacing the complex-valued S4 variant from previous work with a real-valued one does not affect performance much, suggesting that (at least for LM) real-valued SSMs may be a better choice when accounting for hardware efficiency.


• Replacing any of these with a selective SSM (S6) significantly improves performance, validating the motivation of Section 3.


• The Mamba architecture performs similarly to the H3 architecture (and seems slightly better when using a selective layer).


We also investigate interleaving the Mamba block with other blocks such as MLP (a traditional architecture) MHA (a hybrid attention architecture) in Appendix E.2.2

4.6.2 Selective SSM

Table 7 ablates the selective SSM layer by considering different combinations of selective ∆, B, and C parameters (Algorithm 2), showing that ∆ is the most important parameter due to its connection to RNN gating (Theorem 1).


Table 8 considers different initializations of the SSM, which have been shown to make a large difference in some data modalities and settings (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022). On language modeling, we find that simpler real-valued diagonal initializations (S4D-Real, row 3) instead of more standard complex-valued parameterizations (S4D-Lin, row 1) perform better. Random initializations also work well, consistent with findings from prior work (Mehta et al. 2023).


Table 9 and Table 10 consider varying the dimension of the ∆ and (B, C) projections respectively. Changing them from static to selective provides the most benefit, while increasing the dimensions further generally improves performance modestly with a small increase in parameter count.


Of particular note is the dramatic improvement of the selective SSM when the state size 푁 is increased, with over a 1.0 perplexity improvement for a cost of only 1% additional parameters. This validates our core motivation in Sections 3.1 and 3.3.


Table 6: (Ablations: Architecture and SSM layer.) The Mamba block performs similarly to H3 while being simpler. In the inner layer, there is little dierence among dierent parameterizations of LTI models, while selective SSMs (S6) provide a large improvement. More specically, the S4 (real) variant is S4D-Real and the S4 (complex) variant is S4D-Lin.


image


This paper is available on arxiv under CC BY 4.0 DEED license.


Comment on this Story
L O A D I N G
. . . comments & more!

About Author

Rendering Technology Breakthroughs HackerNoon profile picture
Rendering Technology Breakthroughs@rendering
Research and publications on cutting-edge rendering technologies, shaping 2d & 3d visual experiences across industries.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Also published here
Hackernoon
Threads
Bsky

Mentioned in this story

X REMOVE AD