Authors:
(1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution;
(2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Table of Links Abstract and 1 Introduction 2 State Space Models 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3.2 Improving SSMs with Selection 3.3 Efficient Implementation of Selective SSMs 3.4 A Simplified SSM Architecture 3.5 Properties of Selection Mechanisms 3.6 Additional Model Details 4 Empirical Evaluation and 4.1 Synthetic Tasks 4.2 Language Modeling 4.3 DNA Modeling 4.4 Audio Modeling and Generation 4.5 Speed and Memory Benchmarks 4.6 Model Ablations 5 Discussion 6 Conclusion and References A Discussion: Selection Mechanism B Related Work C Mechanics of Selective SSMs D Hardware-aware Algorithm For Selective SSMs E Experimental Details and Additional Results 4.4 Audio Modeling and Generation For the audio waveform modality, we compare primarily to the SaShiMi architecture and training protocols (Goel et al. 2022). This model comprises a U-Net backbone with two stages of pooling by a factor p that doubles the model dimension D per stage,


alternating S4 and MLP blocks in each stage. We consider replacing the S4+MLP blocks with Mamba blocks. Experiment details are in Appendix E.4. 4.4.1 Long-Context Autoregressive Pretraining Both Mamba and the SaShiMi (S4+MLP) baseline improve consistently with longer context lengths; Mamba is better throughout, and the gap widens at longer lengths. The main metric is bits per byte (BPB), which is a constant factor log(2) of the standard negative log-likelihood (NLL) loss for pretraining other modalities. We note one important detail: this is the only experiment in this paper in which we switched from the real parameterization to complex (Section 3.6). We show additional ablations in Appendix E.4. 4.4.2 Autoregressive Speech Generation SC09 is a benchmark speech generation dataset (Donahue, McAuley, and Puckette 2019; Warden 2018), consisting of 1-second clips sampled at 16000 Hz of the digits “zero” through “nine” with highly variable characteristics. We largely follow the autoregressive training setup and generation protocol of Goel et al. (2022). Table 4 shows automated metrics of the Mamba-UNet model compared to a variety of baselines from Goel et al. (2022): WaveNet (Oord et al. 2016), SampleRNN (Mehri et al. 2017), WaveGAN (Donahue, McAuley, and Puckette 2019), DiffWave (Z. Kong et al. 2021), and SaShiMi. A small Mamba model outperforms the state-of-the-art (and much larger) GAN- and diffusion- based models. A larger model parameter-matched to the baselines further improves on fidelity metrics dramatically. Table 5 takes the small Mamba model and investigates combinations of different architectures for the outer stages and center stage. It shows that Mamba is consistently better than S4+MLP in the outer blocks, and Mamba > S4+MLP > MHA+MLP in the center blocks. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution; (2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Authors: Authors: (1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution; (2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 State Space Models 2 State Space Models 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3.2 Improving SSMs with Selection 3.2 Improving SSMs with Selection 3.3 Efficient Implementation of Selective SSMs 3.3 Efficient Implementation of Selective SSMs 3.4 A Simplified SSM Architecture 3.4 A Simplified SSM Architecture 3.5 Properties of Selection Mechanisms 3.5 Properties of Selection Mechanisms 3.6 Additional Model Details 3.6 Additional Model Details 4 Empirical Evaluation and 4.1 Synthetic Tasks 4 Empirical Evaluation and 4.1 Synthetic Tasks 4.2 Language Modeling 4.2 Language Modeling 4.3 DNA Modeling 4.3 DNA Modeling 4.4 Audio Modeling and Generation 4.4 Audio Modeling and Generation 4.5 Speed and Memory Benchmarks 4.5 Speed and Memory Benchmarks 4.6 Model Ablations 4.6 Model Ablations 5 Discussion 5 Discussion 6 Conclusion and References 6 Conclusion and References A Discussion: Selection Mechanism A Discussion: Selection Mechanism B Related Work B Related Work C Mechanics of Selective SSMs C Mechanics of Selective SSMs D Hardware-aware Algorithm For Selective SSMs D Hardware-aware Algorithm For Selective SSMs E Experimental Details and Additional Results E Experimental Details and Additional Results 4.4 Audio Modeling and Generation For the audio waveform modality, we compare primarily to the SaShiMi architecture and training protocols (Goel et al. 2022). This model comprises a U-Net backbone with two stages of pooling by a factor p that doubles the model dimension D per stage, alternating S4 and MLP blocks in each stage. a U-Net backbone with two stages of pooling by a factor p that doubles the model dimension D per stage, a U-Net backbone with two stages of pooling by a factor p that doubles the model dimension D per stage, alternating S4 and MLP blocks in each stage. alternating S4 and MLP blocks in each stage. We consider replacing the S4+MLP blocks with Mamba blocks. Experiment details are in Appendix E.4. 4.4.1 Long-Context Autoregressive Pretraining 4.4.1 Long-Context Autoregressive Pretraining Both Mamba and the SaShiMi (S4+MLP) baseline improve consistently with longer context lengths; Mamba is better throughout, and the gap widens at longer lengths. The main metric is bits per byte (BPB), which is a constant factor log(2) of the standard negative log-likelihood (NLL) loss for pretraining other modalities. We note one important detail: this is the only experiment in this paper in which we switched from the real parameterization to complex (Section 3.6). We show additional ablations in Appendix E.4. 4.4.2 Autoregressive Speech Generation 4.4.2 Autoregressive Speech Generation SC09 is a benchmark speech generation dataset (Donahue, McAuley, and Puckette 2019; Warden 2018), consisting of 1-second clips sampled at 16000 Hz of the digits “zero” through “nine” with highly variable characteristics. We largely follow the autoregressive training setup and generation protocol of Goel et al. (2022). Table 4 shows automated metrics of the Mamba-UNet model compared to a variety of baselines from Goel et al. (2022): WaveNet (Oord et al. 2016), SampleRNN (Mehri et al. 2017), WaveGAN (Donahue, McAuley, and Puckette 2019), DiffWave (Z. Kong et al. 2021), and SaShiMi. A small Mamba model outperforms the state-of-the-art (and much larger) GAN- and diffusion- based models. A larger model parameter-matched to the baselines further improves on fidelity metrics dramatically. Table 5 takes the small Mamba model and investigates combinations of different architectures for the outer stages and center stage. It shows that Mamba is consistently better than S4+MLP in the outer blocks, and Mamba > S4+MLP > MHA+MLP in the center blocks. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Study Demonstrates Mamba’s Breakthrough Performance in Autoregressive Speech Generation

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Simplified State Space Model Architecture

Princeton and CMU Push AI Boundaries with the Mamba Sequence Model

How State Space Models Improve AI Sequence Modeling Efficiency

Why Compressing Information Helps AI Work Better

How Selection Mechanisms Transform State Space Models

Cutting-Edge Techniques That Speed Up AI Without Extra Costs

A Simplified State Space Model Architecture

Princeton and CMU Push AI Boundaries with the Mamba Sequence Model

How State Space Models Improve AI Sequence Modeling Efficiency

Why Compressing Information Helps AI Work Better

How Selection Mechanisms Transform State Space Models

Cutting-Edge Techniques That Speed Up AI Without Extra Costs

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps