Mamba Architecture: What Is It and Can It Beat Transformers?by@kseniase
490 reads
490 reads

Mamba Architecture: What Is It and Can It Beat Transformers?

by Ksenia SeMarch 26th, 2024
Read on Terminal Reader
tldt arrow

Too Long; Didn't Read

Mamba, a new architecture leveraging State-Space Models (SSMs), particularly Structured State Space (S4) models, offers a breakthrough in processing long sequences efficiently, outperforming traditional Transformer-based models with linear complexity scaling. This advancement enables handling tasks like genomic analysis and long-form content generation without memory or compute bottlenecks. Recent papers introduce extensions like EfficientVMamba for resource-constrained deployment, Cobra for multi-modal reasoning, and SiMBA for stability in scaling, showcasing Mamba's architectural flexibility and potential in various domains.
featured image - Mamba Architecture: What Is It and Can It Beat Transformers?
Ksenia Se HackerNoon profile picture

While everyone is focusing on the hot news about Microsoft's acqui-hiring Inflection AI in disguise and the shakeup at Stability AI, we'd like to concentrate on the exciting developments unfolding in the world of model architectures. For the hot news, check the Usual Suspects © section below.

Now, let's talk about Mamba – a new architecture that rivals the famous Transformer-based models. Mamba's innovations address significant challenges in processing long sequences, a problem that has limited traditional models.

So what is it? Mamba leverages state-space models (SSMs)*, particularly excelling with its incorporation of Structured State Space (S4) models into a large language model (LLM) framework. This integration allows Mamba to achieve linear complexity scaling with sequence length, marking a significant advancement over the quadratic scaling seen in traditional Transformer-based models.

Its streamlined architecture incorporates selective SSM layers, enhancing both efficiency and flexibility.

As a result, Mamba efficiently processes extremely long sequences, surpassing earlier models in performance. Additionally, it benefits from hardware-aware optimizations, maximizing the potential of contemporary GPU architectures.

This means you can process much longer sequences without hitting memory or compute bottlenecks. Think about applications like genomic analysis, long-form content generation, and complex multi-modal data processing, all becoming more feasible with Mamba's power.

*State-space models are mathematical frameworks that describe a system's dynamics in terms of its state variables and observations, capturing the evolution and uncertainty of processes over time.  SSMs are known for efficiency with long sequences.

Mamba's ability to efficiently process long sequences while maintaining competitive performance has fueled research interest in adapting and extending the architecture for various domains. It seems Mamba’s architecture is getting more attention (Attention is all you need ;) – last week, three papers showcased exciting developments.

The paper, EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba, makes Mamba more suitable for deployment on resource-constrained devices by introducing an efficient 2D scanning method and a dual-pathway module for balanced global-local feature extraction. Results show a significant reduction in FLOPs while maintaining strong accuracy.

The paper, Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference, extends Mamba to be a multi-modal large language model capable of jointly reasoning over vision and language. Experiments demonstrate competitive performance on vision-language tasks with faster inference speeds compared to Transformer-based models.

The paper, SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time series, presents a simplified Mamba-based architecture that addresses stability issues when scaling Mamba to larger sizes. The key innovation is EinFFT, a novel channel mixing technique that ensures stable optimization. SiMBA shows strong results on vision tasks and multivariate time series forecasting, closing the gap with state-of-the-art Transformers.

These three papers highlight the architectural flexibility and potential of the Mamba model, which is promising for future advancements in context window size and data type support. If you want to move beyond Transformers, that might be the way to go. You can find the Mamba repository here:

News From The Usual Suspects ©

Microsoft Is Hungry

  • Microsoft basically owns OpenAI (“Indeed, as the November 2023 drama was unfolding, Microsoft’s CEO boasted that it would not matter “[i]f OpenAI disappeared tomorrow.” He explained that “[w]e have all the IP rights and all the capability.” “We have the people, we have the compute, we have the data, we have everything.” “We are below them, above them, around them.” — from Elon Musk’s lawsuit against OpenAI).

  • Last February, Microsoft invested in Mistral and brought its newest AI model, Mistral Large, to Azure.

  • Last week, Microsoft basically acqui-hired Inflection AI’s team, including two of its co-founders: Mustafa Suleyman and Karén Simonyan. Eric Newcomer points out at Microsoft’s creativity in non-acquisition strategies, such as partnering with and investing in companies like Inflection and OpenAI without formal acquisitions, sidestepping antitrust reviews. The Stratechery explains why many aspects of Inflection and its rapid ascent to Unicorn status were odd from the very beginning. Soma Somasegar from Madrona VC thinks that Microsoft partners with to innovate AI for consumers, aiming to transform its presence in the consumer market.

Stability AI Evokes Many Jokes About Instability

  • This January, we published Stability AI profile with the subtitle: “Investigating the Thin Line Between Genuine Innovation and Strategic Exaggeration in AI”. On March 23, after too many exaggerations Emad Mostaque resigns from his position as CEO of Stability AI. According to him, he wants to concentrate on decentralized AI. What does it mean exactly — no one knows.


Hugging Face

  • announces the release of Common Corpus, the largest public domain dataset designed for training (LLMs. Encompassing 500 billion words across multiple languages including English, French, Dutch, Spanish, German, and Italian, it’s a multilingual collection sourced from diverse cultural heritage initiatives. This release aims to demonstrate the feasibility of developing open LLMs using copyright-free materials, supported by an international collaboration with organizations dedicated to open science in AI.

Sam Altman Is Open-Sourcing His Orb

  • The Worldcoin Foundation has made the Orb’s software (Orb is a device that scans irises to create unique digital IDs) open-source, aiming to enhance privacy and security in proving humanness online. The software, available under MIT/Apache 2.0 licenses on GitHub, supports World ID verification by processing images locally and ensuring data privacy through secure transfers.

Enjoyed This Story?

I write a weekly analysis of the AI world in the Turing Post newsletter. We aim to equip you with comprehensive knowledge and historical insights so you can make informed decisions about AI and ML.

Turing Post newsletter