Authors:
(1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution;
(2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Table of Links Abstract and 1 Introduction 2 State Space Models 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3.2 Improving SSMs with Selection 3.3 Efficient Implementation of Selective SSMs 3.4 A Simplified SSM Architecture 3.5 Properties of Selection Mechanisms 3.6 Additional Model Details 4 Empirical Evaluation and 4.1 Synthetic Tasks 4.2 Language Modeling 4.3 DNA Modeling 4.4 Audio Modeling and Generation 4.5 Speed and Memory Benchmarks 4.6 Model Ablations 5 Discussion 6 Conclusion and References A Discussion: Selection Mechanism B Related Work C Mechanics of Selective SSMs D Hardware-aware Algorithm For Selective SSMs E Experimental Details and Additional Results A Discussion: Selection Mechanism Our selection mechanism is inspired by and related to concepts such as gating, hypernetworks, and data-dependence. It can also be viewed as related to “fast weights” (J. Ba et al. 2016), which connects classical RNNs with the mechanism of linear attention (Schlag, Irie, and Schmidhuber 2021). However, we believe that it is a distinct concept that is worth clarifying. Gating. Gating originally referred to the gating mechanisms of RNNs such as the LSTM (Hochreiter and Schmidhuber 1997) and GRU (J. Chung et al. 2014), or the gated equation (5)n Theorem 1. This was interpreted as a particular mechanism for controlling whether to let an input into the hidden state of an RNN. In particular, this affects the propagation of signal through time and causes inputs to interact along the sequence length dimension. However, the concept of gating has since been relaxed in popular usage to simply mean any multiplicative interaction (often with an activation function). For example, elementwise multiplicative components of neural network architectures (that do not interact along sequence length) are now commonly referred to as gated architectures (Hua et al. 2022; Mehta et al. 2023), despite a very different meaning than the original RNN sense. Thus we believe the original concept of RNN gating versus the popular usage of multiplicative gating actually have a very different semantic meaning. Hypernetworks. Hypernetworks refer to neural networks whose parameters are themselves generated by smaller neural networks. The original idea (Ha, Dai, and Quoc V. Le 2017) used it in a narrow sense to define a large RNN whose recurrent parameters are generated by a smaller RNN. Data-dependence. Similar to hypernetworks, data-dependence can refer to any notion where some parameters of the model depend on the data (Poli et al. 2023). This is a rather trivial transformation, yet it technically satisfies the common meanings of gating (since it has a multiplicative “branch”), hypernetworks (since the parameter D is generated by another layer), and data-dependent (since D depends on the data x). However, this in fact simply defines a GLU function, which is so simple that it is often considered just an activation function (Dauphin et al. 2017; Shazeer 2020) instead of a meaningful layer. Selection. Thus, while selection mechanisms could be considered a special case of ideas such as architectural gating, hypernetworks, or data-dependence, so can an enormous range of other constructions—essentially anything with a multiplication, including standard attention mechanisms (Bahdanau, Cho, and Bengio 2015; Vaswani et al. 2017) as well—and we find it uninformative to think of them as such. Instead, we view it as most closely related to the gating mechanism of traditional RNNs, which is a special case (Theorem 1) and also has a deeper history of connections to SSMs through variable (input-dependent) discretization of ∆ (Funahashi and Nakamura 1993; Gu, Dao, et al. 2020; Tallec and Ollivier 2018). We also eschew the term “gating” in favor of selection to clarify the overloaded use of former. More narrowly, we use selection to refer to the mechanistic action of a model to select or ignore inputs and facilitate data interaction along the sequence length (Section 3.1). Beyond selective SSMs and gated RNNs, other examples may include input-dependent convolutions (Kosma, Nikolentzos, and Vazirgiannis 2023; Lioutas and Guo 2020; Lutati, Zimerman, and Wolf 2023; Yang et al. 2019) and even attention. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution; (2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Authors: Authors: (1) Albert Gu, Machine Learning Department, Carnegie Mellon University and with equal contribution; (2) Tri Dao, Department of Computer Science, Princeton University and with equal contribution. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 State Space Models 2 State Space Models 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3 Selective State Space Models and 3.1 Motivation: Selection as a Means of Compression 3.2 Improving SSMs with Selection 3.2 Improving SSMs with Selection 3.3 Efficient Implementation of Selective SSMs 3.3 Efficient Implementation of Selective SSMs 3.4 A Simplified SSM Architecture 3.4 A Simplified SSM Architecture 3.5 Properties of Selection Mechanisms 3.5 Properties of Selection Mechanisms 3.6 Additional Model Details 3.6 Additional Model Details 4 Empirical Evaluation and 4.1 Synthetic Tasks 4 Empirical Evaluation and 4.1 Synthetic Tasks 4.2 Language Modeling 4.2 Language Modeling 4.3 DNA Modeling 4.3 DNA Modeling 4.4 Audio Modeling and Generation 4.4 Audio Modeling and Generation 4.5 Speed and Memory Benchmarks 4.5 Speed and Memory Benchmarks 4.6 Model Ablations 4.6 Model Ablations 5 Discussion 5 Discussion 6 Conclusion and References 6 Conclusion and References A Discussion: Selection Mechanism A Discussion: Selection Mechanism B Related Work B Related Work C Mechanics of Selective SSMs C Mechanics of Selective SSMs D Hardware-aware Algorithm For Selective SSMs D Hardware-aware Algorithm For Selective SSMs E Experimental Details and Additional Results E Experimental Details and Additional Results A Discussion: Selection Mechanism Our selection mechanism is inspired by and related to concepts such as gating, hypernetworks, and data-dependence. It can also be viewed as related to “fast weights” (J. Ba et al. 2016), which connects classical RNNs with the mechanism of linear attention (Schlag, Irie, and Schmidhuber 2021). However, we believe that it is a distinct concept that is worth clarifying. Gating . Gating originally referred to the gating mechanisms of RNNs such as the LSTM (Hochreiter and Schmidhuber 1997) and GRU (J. Chung et al. 2014), or the gated equation (5)n Theorem 1. This was interpreted as a particular mechanism for controlling whether to let an input into the hidden state of an RNN. In particular, this affects the propagation of signal through time and causes inputs to interact along the sequence length dimension. Gating However, the concept of gating has since been relaxed in popular usage to simply mean any multiplicative interaction (often with an activation function). For example, elementwise multiplicative components of neural network architectures (that do not interact along sequence length) are now commonly referred to as gated architectures (Hua et al. 2022; Mehta et al. 2023), despite a very different meaning than the original RNN sense. Thus we believe the original concept of RNN gating versus the popular usage of multiplicative gating actually have a very different semantic meaning. Hypernetworks . Hypernetworks refer to neural networks whose parameters are themselves generated by smaller neural networks. The original idea (Ha, Dai, and Quoc V. Le 2017) used it in a narrow sense to define a large RNN whose recurrent parameters are generated by a smaller RNN. Hypernetworks Data-dependence. Similar to hypernetworks, data-dependence can refer to any notion where some parameters of the model depend on the data (Poli et al. 2023). Data-dependence. This is a rather trivial transformation, yet it technically satisfies the common meanings of gating (since it has a multiplicative “branch”), hypernetworks (since the parameter D is generated by another layer), and data-dependent (since D depends on the data x). However, this in fact simply defines a GLU function, which is so simple that it is often considered just an activation function (Dauphin et al. 2017; Shazeer 2020) instead of a meaningful layer. Selection . Thus, while selection mechanisms could be considered a special case of ideas such as architectural gating, hypernetworks, or data-dependence, so can an enormous range of other constructions—essentially anything with a multiplication, including standard attention mechanisms (Bahdanau, Cho, and Bengio 2015; Vaswani et al. 2017) as well—and we find it uninformative to think of them as such. Selection Instead, we view it as most closely related to the gating mechanism of traditional RNNs, which is a special case (Theorem 1) and also has a deeper history of connections to SSMs through variable (input-dependent) discretization of ∆ (Funahashi and Nakamura 1993; Gu, Dao, et al. 2020; Tallec and Ollivier 2018). We also eschew the term “gating” in favor of selection to clarify the overloaded use of former. More narrowly, we use selection to refer to the mechanistic action of a model to select or ignore inputs and facilitate data interaction along the sequence length (Section 3.1). Beyond selective SSMs and gated RNNs, other examples may include input-dependent convolutions (Kosma, Nikolentzos, and Vazirgiannis 2023; Lioutas and Guo 2020; Lutati, Zimerman, and Wolf 2023; Yang et al. 2019) and even attention. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Why Selection Mechanisms Are Key to the Future of Sequence Modeling

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Simplified State Space Model Architecture

Princeton and CMU Push AI Boundaries with the Mamba Sequence Model

How State Space Models Improve AI Sequence Modeling Efficiency

Why Compressing Information Helps AI Work Better

How Selection Mechanisms Transform State Space Models

Cutting-Edge Techniques That Speed Up AI Without Extra Costs

A Simplified State Space Model Architecture

Princeton and CMU Push AI Boundaries with the Mamba Sequence Model

How State Space Models Improve AI Sequence Modeling Efficiency

Why Compressing Information Helps AI Work Better

How Selection Mechanisms Transform State Space Models

Cutting-Edge Techniques That Speed Up AI Without Extra Costs

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps