The following is an abstract of some thoughts I had from the RWKV podcast found here:
https://www.latent.space/p/rwkv#details
Transformer architectures are currently at their peak with the AI revolution of 2023. However, in the rush to adopt them due to their recent success, it's easy to ignore alternatives that we can learn from.
As engineers, we should not take a one-size-fits-all approach, using the same hammer for every nail. We should find trade-offs in every solution. Failure to do so would trap us within the limitations of one particular platform while remaining blissfully unaware of all other alternatives. This could potentially hold back development for years.
This problem is not unique to AI. It's a historical pattern that has repeated itself since ancient times up to the modern day.
A notable recent example in software development was the trend towards NoSQL when SQL servers started hitting physical limitations. Startups everywhere made the move to NoSQL for "scale" reasons despite being nowhere near those scales.
However, over time, as eventual consistency and NoSQL management overhead emerged, and hardware capabilities made giant leaps in SSD speed and capacity, we've seen a recent trend back to SQL servers for their simplicity and now sufficient scalability for 90%+ of startups.
Does this mean SQL is better than NoSQL or vice versa? No, it simply means each technology has preferred use cases with pros/cons and learning points that can cross-pollinate among similar technologies.
I hope you view this article as not an attack on transformers, but a healthy appreciation on alternatives. With that, lets get on to the main topic
Generally, this covers compute, context length, datasets, and alignment. For this discussion, we'll focus on compute and context length:
So, how do we solve this?
The following summarises the RWKV paper found: here
For a deeper technical dive, see the: wiki here
RWKV, along with Microsoft RetNet, are among the first of a new class called "Linear Transformers."
It tackles the above three limitations directly by supporting the following:
As we continue to scale up AI models to large context sizes. From 100k and beyond. The quadratic computational costs start to increase exponentially.
Linear Transformers, however, instead of abandoning the Recurrent Neural Network architecture and their bottlenecks, which forced them to be replaced by transformers.
Took lessons learned from making transformers scalable, re-engineered RNNs to work similarly to transformers, and removed all those bottlenecks.
Bringing them back into the race with transformers in terms of training speeds. Allowing them to both operate efficiently at O(N) cost while scaling past a billion parameters in training. All while maintaining similar performance levels.
As you scale quadratically vs. linear, you will hit over 10x at a 2k token count & over 100x at a 100k token length.
At 14B parameters, RWKV is the largest open-source linear transformer, on par with GPT-NeoX and others on similar datasets (e.g. The Pile).
But what does this mean, in simpler terms?
Pros
Cons
So, while RWKV isn't yet at 60B+ parameter scales like LLaMA2, with the right support/resources, it has the potential to get there at lower cost and larger contexts, especially as expert models trend towards smaller and more efficient.
Consider it if efficiency matters for your use case. But it's not the final solution - healthy alternatives are still key.
Also one last thing i forget to mentioned:
Linear Transformers typically do not use classic Transformer attention, but an approximate, meaning ...
Disclosure: I am a code contributor to the RWKV project and an active member of the community. However, it does not mean I view RWKV as the final answer; even if in the future, where "Linear Transformers" beat out traditional "Transformer" networks in the future, research into alternatives is still important for healthy ecosystem development.
I strongly believe there is no such thing as "one architecture to rule them all.” Additionally, I am aware there have since been developments like TransformerXL to more efficiently scale transformers, somewhere between quadratic cost and linear cost. However, with very limited known implementation of this method in public, it is hard for me to evaluate its pros and cons.
Also published here.