I build Turing Post, a newsletter about AI and ML equipping you with in-depth knowledge. http://www.turingpost.com/
This week, a largely unknown company, Groq, demonstrated unprecedented speed running open-source LLMs such as Llama-2 (70 billion parameters) at more than 100 tokens per second, and Mixtral at nearly 500 tokens per second per user on Groq’s Language Processing Unit (LPU).
For the comparison:
So: What is LPU, how does it work, and where is Groq (such an unfortunate name, given Musk’s Grok is all over the media) coming from?
Remember that game of Go in 2016 when AlphaGo played against the world champion Lee Sedol and won? Well, about a month before the competition, there was a test game which AlphaGo lost. The researchers from DeepMind ported AlphaGo to the Tensor Processing Unit (TPU), and then the computer program was able to win by a wide margin.
The realization that computational power was a bottleneck for AI’s potential led to the inception of Groq and the creation of the LPU. This realization came to Jonathan Ross who initially began what became the TPU project at Google. He started Groq in 2016.
The LPU is a special kind of computer brain designed to handle language tasks very quickly. Unlike other computer chips that do many things at once (parallel processing), the LPU works on tasks one after the other (sequential processing), which is perfect for understanding and generating language.
Imagine it like a relay race where each runner (chip) passes the baton (data) to the next, making everything run super fast. The LPU is designed to overcome the two LLM bottlenecks: compute density and memory bandwidth.
Groq took a novel approach right from the start, focusing on software and compiler development before even thinking about the hardware. They made sure the software could guide how the chips talk to each other, ensuring they work together seamlessly like a team in a factory.
This makes the LPU really good at processing language efficiently and at high speed, ideal for AI tasks that involve understanding or creating text.
This led to a highly optimized system that not only runs circles around traditional setups in terms of speed but does so with greater cost efficiency and lower energy consumption. This is big news for industries like finance, government, and tech, where quick and accurate data processing is key.
Now, don’t go tossing out your GPUs just yet! While the LPU is a beast when it comes to inference, making light work of applying trained models to new data, GPUs still reign supreme in the training arena. The LPU and GPU might become the dynamic duo of AI hardware, each excelling in their respective roles.
As Elvis Saravia put it: “With breakthroughs in inference and long context understanding, we are officially entering a new era in LLMs.”
To better understand architecture, Groq offers two papers: from 2020 (Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads) and 2022 (A Soware-defined Tensor Streaming Multiprocessor for Large-scale Machine Learning). The term “LPU” must be a recent addition to Groq’s narrative, since it’s never mentioned in the papers.
Additional read:
Introducing Aya: Aya’s dataset: https://arxiv.org/pdf/2402.06619.pdf
Introducing Sora: This paper introduces Sora, a breakthrough in video generation technology by OpenAI, capable of producing high-fidelity videos. It leverages spacetime patches to handle videos of varying durations and resolutions, making strides toward simulating the physical world with impressive 3D consistency and long-range coherence.
It represents a leap in the ability to create detailed simulations that could be used for a myriad of applications, from entertainment to virtual testing environments →read the paper.
Additional read:
Introducing V-Jepa (Yann LeCun’s vision of advanced machine intelligence (AMI): Meta’s V-JEPA model revolutionizes unsupervised learning from videos by using feature prediction as its sole objective. This approach bypasses the need for pre-trained image encoders or text annotations, relying instead on the intrinsic dynamics of video data to learn versatile visual representations.
It’s a significant contribution to the field of unsupervised visual learning, promising advancements in how machines understand motion and appearance without explicit guidance →read the paper.
Introducing Gemini 1.5: Google DeepMind’s Gemini 1.5 introduces a Mixture-of-Experts architecture, enhancing the model’s performance across a broader array of tasks. Notably, it expands the context window to 1 million tokens, enabling deep analysis over large datasets.
Gemini 1.5 represents a significant step forward in AI’s capability to process and understand extensive contexts, marking a milestone in the development of multimodal models →read the paper.
Introducing Stable Cascade: Stable Cascade from Stability AI introduces a novel text-to-image generation framework that prioritizes efficiency, ease of training, and fine-tuning on consumer-grade hardware.
The model’s hierarchical compression technique represents a significant reduction in the resources required for training high-quality generative models, providing a pathway for wider accessibility and experimentation in the AI community →read the paper.
Also published here