We are only talking about time complexity in this article - deliberately.
For space complexity, refer to my article on 1-bit transformers, available here:
We are racing forward into the future as far as Generative AI technology is concerned and the algorithms behind Large language Models are no exception. In this article, we are going to cover three of the most exciting developments in the field of generative AI recently, and talk about them in detail. One of them has also achieved the optimal time complexity to run a large language model algorithm. In other words, a recent development has become the most optimally fastest LLM transformer algorithm possible - it is, by our current models, not possible to go faster than that as far as asymptotic time complexity is concerned, except by constant time optimizations. Because we are dealing with hundreds of billions of parameters, the constants’ speed-up can be rather large! I hope you’re as excited as I am because this will be an exciting ride!
Everyone is familiar with the seminal 2017 Attention is all you need paper, but I am going to summarize it anyway so that newcomers will have a clearer picture of what we’re talking about.
This is the link to the research paper:
From the paper introduction:
Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation.
Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
Recurrent models typically factor computation along the symbol positions of the input and output sequences.
Aligning the positions to steps in computation time, they generate a sequence of hidden states ℎ𝑡, as a function of the previous hidden state ℎ𝑡−1 and the input for position 𝑡.
This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance in case of the latter.
The fundamental constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.
In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
And as we know, the GPT-1, GPT-2, GPT-3 and the GPT 3.5 transformers soon revolutionized generative AI forever. Suddenly machines could speak seemingly human English.
This was the classic diagram which dominated articles and research news bulletins for the next two years:
Then GPT-4 came out - and life would never be the same again.
We had crossed a tipping point.
But, these transformers were expensive, slow to train, and difficult to deploy because of their extremely high operating costs.
The time complexity of the Transformer algorithm was quadratic, or O(n*n) where n was the number of input parameters.
For a standard transformer model with 𝐿 layers, the time complexity of the inference algorithm is 𝑂(L*n*n*d) where L was the number of layers, n the number of input tokens, and d the depth of the transformer.
This seemed to be the state of the art - for a while.
Quantization was introduced in another paper as early as 2021, and that seemed to be the next state-of-the-art mechanism (See the Prologue section).
But soon we were about to have another contender.
This was the relevant research paper:
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
From the research paper abstract:
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module.
Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language.
We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements.
First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token.
Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba).
Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences.
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
Suddenly we had a new competitor in town!
The main advantages of the Mamba- transformer algorithm were:
The Mamba architecture represents a significant advancement in large language models, combining the strengths of both Transformer and SSM approaches.
However, as testing continued, it was found that the Mamba algorithm was not a suitable contender for all use-cases.
In particular, the Mamba algorithm failed miserably when presented with the IMDB dataset.
However the architecture was still state-of-the-art, and it was found to be hugely useful with vision use cases.
You can see an implementation in Python here:
And this is an excellent explanation of the Mamba algorithm with the theory also provided.
And here is the standard Mamba implementation in PyPI:
The Mamba algorithm had its day and is still a highly active area of research. A successor soon came out, but we’ll save the best for last.
We’ll move on to the next contender - the xLSTM algorithm
You can refer to the research paper here:
xLSTM: Extended Long Short-Term Memory - arXiv.
From the research paper abstract:
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM).
Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs).
However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale.
We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs?
Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques.
Secondly, we modify the LSTM memory structure, obtaining:
(i) sLSTM with a scalar memory, a scalar update, and new memory mixing,
(ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule.
Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures.
Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
The Long Short-Term Memory (LSTM) Algorithm was highly useful in its day and had its fair share of successes.
The xLSTM used the same model but in an entirely different architecture.
This was the main innovation, summarized in this diagram in the research paper:
The main advantages of the xLSTM were:
However, it was important to note that Transformers had their own set of advantages, such as better parallelization capabilities, superior performance on large datasets, and state-of-the-art results in many NLP tasks.
The choice between xLSTM and Transformer has to be been based on the specific requirements and constraints of the task at hand.
You can see an implementation of xLSTM in PyTorch here:
You can see a detailed explanation of xLSTM here:
This is a good summary of its current status:
But there was a successor to Mamba that hit the Holy Grail - the Optimal Time Complexity for a LLM algorithm.
The research paper can be found here:
Jamba: A Hybrid Transformer-Mamba Language Model
From the abstract of the Research Paper:
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture.
Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families.
MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable.
This flexible architecture allows resource- and objective-specific configurations.
In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU.
Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations.
Remarkably, the model presents strong results for up to 256K tokens context length.
We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling.
We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture.
We make the weights of our implementation of Jamba publicly available under a permissive license.
The implementation is available on the HuggingFace repository here:
Model: https://huggingface.co/ai21labs/Jamba-v0.1
In summary, Jamba’s hybrid architecture combines the strengths of Transformers and Mamba layers, resulting in impressive performance and scalability.
The key diagram to remember is the one presented in this research paper above:
The interleaving of Mamba and Transformer models leads to an incredible increase in Time Complexity, which is beautifully summarized in the article below:
You can read the full article here:
Mamba and Jamba — Simply Explained, by Nimrita Koul, on Medium.com.
The key point to note here is that, for training, the algorithm has to look at every input token at least once, giving a time complexity of O(n).
Also, the fastest possible speed that inference can take for any LLM model is O(1) - constant time, independent of the length of the tokens (an incredible achievement)!
Therefore under constant-time improvements - which may still be very high (these numbers are in the hundreds of billions):
Jamba has reached the optimal Bound for Time Complexity for a Transformer Algorithm!
Under the given system conditions, unless new technology is introduced (quantum computing, anyone) we simply cannot have a faster asymptotic time complexity!
Which is a very significant result!
The official announcement by A121 labs:
Another good article on Medium on Jamba:
One of the best implementations of Jamba available at the moment:
Once again, the HuggingFace Hub’s Jamba model:
Thus Jamba reaches the ultimate time complexity that can be achieved by a current transformer algorithm under the existing system, to a constant level variation. Repeat; the constants may be very large, because these are in the order of hundreds of billions of terms! However, this is still a substantial achievement. And there are no limits to where the research on this can go especially when combined with DPO (Direct Preference Optimization) and Quantization - see the Epilogue for more.
There is one side to this that no one seems to be working on openly.
Can Mamba, the xLSTM, and the Jamba models be quantized to 1-bit precision?
Of course!
I cannot wait to see Mamba’s and Jamba’s performance improvements once quantized to one-bit! Or 1.58 bit {-1, 0, 1 }.
Once again, see this article for more details:
https://hackernoon.com/why-1-bit-transformers-will-change-the-world
The future of this technology is going to be incredibly exciting!
May the joy and the thrill of working in this field always remain with you!
Cheers!
For quantization, this paper is definitely worth a read:
And the model on HuggingFace: