paint-brush
How AI is Learning to Read DNA and Soundby@rendering
214 reads New Story

How AI is Learning to Read DNA and Sound

by Rendering Technology Breakthroughs
Rendering Technology Breakthroughs HackerNoon profile picture

Rendering Technology Breakthroughs

@rendering

Research and publications on cutting-edge rendering technologies, shaping 2d &...

March 16th, 2025
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

AI models like Transformer++, HyenaDNA, and Mamba are pushing boundaries in DNA and audio modeling. Mamba shows competitive efficiency, but real-valued SSMs underperform complex-valued ones in audio tasks.

People Mentioned

Mention Thumbnail

Machine Learning

@machinelearning2

Companies Mentioned

Mention Thumbnail
Abstract
Mention Thumbnail
Empirical
featured image - How AI is Learning to Read DNA and Sound
1x
Read by Dr. One voice-avatar

Listen to this story

Rendering Technology Breakthroughs HackerNoon profile picture
Rendering Technology Breakthroughs

Rendering Technology Breakthroughs

@rendering

Research and publications on cutting-edge rendering technologies, shaping 2d & 3d visual experiences across industries.

About @rendering
LEARN MORE ABOUT @RENDERING'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Albert Gu, Machine Learning Department, Carnegie Mellon University with Equal contribution (agu@cs.cmu.edu);

(2) Tri Dao, Department of Computer Science, Princeton University with Equal contribution (tri@tridao.me).

Abstract and 1. Introduction

2 State Space Models

3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression

3.2 Improving SSMs with Selection

3.3 Efficient Implementation of Selective SSMs

3.4 A Simplifed SSM Architecture

3.5 Properties of Selection Mechanisms

3.6 Additional Model Details

4 Empirical Evaluation and 4.1 Synthetic Tasks

4.2 Language Modeling

4.3 DNA Modeling

4.4 Audio Modeling and Generation

4.5 Speed and Memory Benchmarks

4.6 Model Ablations

5 Discussion

6 Conclusion, Acknowledgments and References

A Discussion: Selection Mechanism

B Related Work and B.1 S4 Variants and Derivatives

B.2 SSM Architectures

B.3 Relationship to RNNs

B.4 Linear Attention and B.5 Long Context Models

C Mechanics of Selective SSMs

D Hardware-aware Algorithm For Selective SSMs

E Experimental Details and Additional Results and E.1 Synthetic Tasks

E.2 Language Modeling

E.3 DNA Modeling

E.4 Audio Details

E.5 Efficiency Benchmark


E.3 DNA Modeling

E.3.1 Pretraining Details

We describe the dataset and training procedure of the HG38 pretraining task in more detail.


image


E.3.2 Scaling: Model Size Details

Models. The models we consider are:


• Transformer++: a Transformer with improved architecture, notably the usage of RoPE positional encodings (Su et al. 2021). Informally, we found these to be noticeably better than vanilla positional encodings from (Vaswani et al. 2017).


• HyenaDNA: the Hyena model from Nguyen, Poli, et al. (2023) and Poli et al. (2023), which is roughly a Transformer with the MHA block replaced by an H3 block using a global convolution parameterized by an MLP.


• Mamba: the standard Mamba architecture.


Model Sizes. We use the following model sizes.


Note that the number of blocks for Mamba is doubled, because one Transformer “layer” includes both the MHA and MLP blocks (and similarly for Hyena), which requires two Mamba blocks to match parameters (Section 3.4).

Note that the number of blocks for Mamba is doubled, because one Transformer “layer” includes both the MHA and MLP blocks (and similarly for Hyena), which requires two Mamba blocks to match parameters (Section 3.4).



Note that, in contrast to standard LM scaling laws (Table 12), our LR held constant across model sizes for simplicity. The optimal LR should go down for larger models, but we didn’t find a noticeable effect at the small model sizes (at most a few million parameters) we considered.

Note that, in contrast to standard LM scaling laws (Table 12), our LR held constant across model sizes for simplicity. The optimal LR should go down for larger models, but we didn’t find a noticeable effect at the small model sizes (at most a few million parameters) we considered.


E.3.3 Scaling: Context Length Details


image



Remark E.1. We also note that the schedule was not tuned, and we never experimented with turning o sequence length warmup for these pretraining experiments. We later found that SLW did not help noticeably for audio pretraining at similar lengths (Section 4.4), and it is possible that it is not necessary for DNA pretraining either.

Remark E.1. We also note that the schedule was not tuned, and we never experimented with turning o sequence length warmup for these pretraining experiments. We later found that SLW did not help noticeably for audio pretraining at similar lengths (Section 4.4), and it is possible that it is not necessary for DNA pretraining either.


E.3.4 Species (Great Apes) Classication


image


Training consists of 10 epochs, each of which has 1024 gradient steps. Each gradient step uses batch size 64, which are all independently randomly drawn by uniformly picking a species, uniformly picking a chromosome, and then uniformly picking a contiguous segment of DNA.

image


Results for the Species classification task are in Table 13.

E.4 Audio Details

E.4.1 YouTubeMix Audio Pretraining

Model. We use a model with 3 blocks per stage (3 × 5 = 15 total Mamba blocks), pooling factor p = 16, and outer dimension D = 64, for about 3.5M parameters.


Dataset. The data is mu-law encoded at 8 bits, so the model is modeling discrete tokens with a vocab size of 256.


The dataset consists of clips of up to 1 minute long, or length 960000, which is subsampled and divided into segments of any desired sequence length. Since the architecture involves two stages of pooling by a factor of 16

Figure 10: (Audio Pretraining (YouTubeMix) Ablations.) As a uniformly-sampled “continuous” signal modality, audio waveforms actually benefit from LTI models which have matching inductive bias. (Left) Homogenous models (all blocks have the same parameterization) (Right) Only the center U-Net blocks are ablated; the outer blocks are Mamba-S4. Purple line is same as figure on left.

Figure 10: (Audio Pretraining (YouTubeMix) Ablations.) As a uniformly-sampled “continuous” signal modality, audio waveforms actually benefit from LTI models which have matching inductive bias. (Left) Homogenous models (all blocks have the same parameterization) (Right) Only the center U-Net blocks are ablated; the outer blocks are Mamba-S4. Purple line is same as figure on left.

and we want the resulting sequence length to be a a multiple of 8 for hardware efficiency, the longest possible sequence is 468 × 2048 = 958464. The rest of our sequence lengths are defined by successively halving this and rounding up to the nearest multiple of 2048.


Table 14 lists the specifications used in Figure 7. Beyond the varying batch sizes, the number of valid segments in the training set varied between different sequence lengths (e.g. the number of training steps per epoch was not constant for different points in the graph), which may have contributed to kinks in the scaling curves.


Training. Models were trained for 200K training steps with a maximum learning rate of 0.002, 20K (10%) warmup steps, and weight decay 0.1 (similar to our general pretraining recipe across domains).


image

Figure 10 shows that the change from S4 → S6 (i.e. the selection mechanism) is not always beneficial. On long-form audio waveforms, it in fact significantly hampers performance, which may be intuitive from the point of view that audio is uniformly sampled and very smooth, and therefore benefits from continuous linear time-invariant (LTI) methods. After ablating away the selection mechanism, note that the resulting model is the S4 layer inside the Mamba block. To disambiguate, we call this Mamba-S4 as opposed the default Mamba architecture Mamba-S6.


However, on the right side, we keep the outer layers of the U-Net Mamba-S4 and ablate only the inner layers. The performance differences shrink dramatically; this reinforces the hypothesis that layers closer to the raw audio signal should be LTI, but once they are “tokenized” and compressed by the outer layers, the inner layers no longer need to be LTI. In this setting however, the real-valued SSM still underperforms the complex-valued one.

E.4.2 SC09 Speech Generation

Autoregressive training largely followed the autoregressive language modeling protocol, such as


image

We used a learning rate of 0.002 and 200000 training steps at a batch size of 16.


The large Mamba model in Table 4 has 15 layers per stage with an outer dimension of D = 96 and pooling factor 4. We note that this dataset is small (training went through 100 epochs) and for this large model, there was significant overfitting of the BPB or NLL. However, automated metrics of generated samples continually improving throughout training.


image

E.5 Efficiency Benchmark

Scan Operation. We compare the core operation of selective SSMs, which is the parallel scan (Section 3.3), against convolution and attention, measured on an A100 80GB PCIe GPU. Note that these do not include the cost of other operations outside of this core operation, such as computing the convolutional kernel in global-convolution models, or computing the QKV projections in attention.


image


Our scan implementation fuses the discretization step and the parallel scan, avoiding the cost of materializing all the large parameters in HBM.


For convolution, we use the standard implementation in PyTorch, which separately performs FFTs on the inputs and the filters, multiply them in frequency domain, then performs an inverse FFT to obtain the result. The theoretical complexity is O(L log(L)) for sequence length L.


For attention, we compare against the fastest implementation that we are aware of (FlashAttention-2 (Dao 2023)), with causal mask. Note that FlashAttention-2 with causal mask is about 1.7× faster than without causal mask, since approximately only half of the attention entries are computed.


image


End-to-end Inference. We measure the inference throughput of a Mamba 1.4B model and an untrained Mamba 6.9B model, against a standard Transformer (GPT3 architecture) at 1.3B and 6.7B size. We use the standard Transformer implementation in the Huggingface transformers library.


We set the prompt length to be 2048 and the generation length to be 128. We vary the batch size from 1, 2, 4, 8, 16, 32, 64, to 128, and measure time time taken to generate 128 tokens. We then calculate the throughput (tokens/s) as batch size × 128∕time taken. We repeat the measurements 3 times and take the average. Measurements are done on an A100 80GB PCIe GPU.


Memory Benchmark. The memory usage simply scales proportionally to the size of the activation tensors, as with most deep sequence models. We report measurements of the training memory requirements of 125M models


Table 15: (Memory benchmark.) Mamba’s memory footprint is comparable to the most optimized Transformer. Results for 125M models.

Table 15: (Memory benchmark.) Mamba’s memory footprint is comparable to the most optimized Transformer. Results for 125M models.


on 1 A100 80GB GPU. Each batch consists of sequences of length 2048. We compare to the most memory-efficient Transformer implementation we are aware of (with kernel fusion from torch.compile and with FlashAttention-2). Table 15 shows that Mamba’s memory requirement is comparable to a similar-sized Transformer with an extremely optimized implementation, and we expect further improvement in Mamba’s memory footprint in the future.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

Rendering Technology Breakthroughs HackerNoon profile picture
Rendering Technology Breakthroughs@rendering
Research and publications on cutting-edge rendering technologies, shaping 2d & 3d visual experiences across industries.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Also published here
Hackernoon
X
Threads
Bsky

Mentioned in this story

profiles
X REMOVE AD