What we learned about the future of AI Chips by keeping track of NVIDIA’s latest announcements, talking to industry experts, and scanning news and analyses. Exploring has been a pastime, as a well as a popular theme in . In 2023, we felt like we fell somewhat behind on that..but then again, does that matter? Doesn’t NVIDIA still reign supreme – , more than 80% market share, H100s selling like hot bread and breaking all records and all? Well, yes, but..not so fast. AI chips Orchestrate all the Things articles 1 trillion valuation After having the chance to pick CPO of AI at HPE Evan Sparks’ brain at the , sit at a couple of NVIDIA’s press conferences, and scan a ton of news and analyses so you don’t have to, we have a more nuanced view to share on AI Chips in 2024. Here’s what’s going on and how it’s likely to affect AI going forward. AI Chips episode of our “What’s New in AI” series with O’Reilly NVIDIA breaks MLPerf benchmark records Let’s start with the news. Yesterday, . MLPerf is the de facto standard in AI workload benchmarks, and as more AI workloads emerge . With Generative AI taking off over the last year, MLPerf has added Gen AI workloads to its arsenal. NVIDIA announced their results from the latest MLPerf submissions MLPerf keeps adding to it suite Having previously added a benchmark that uses a portion of the full GPT-3 data set to train a Large Language Model (LLM), the is a training benchmark based on the Stable Diffusion text-to-image model. NVIDIA aced both of these, as well as a few more. . latest addition to MLPerf Intel and Google also boast big AI training gains — an AI supercomputer powered by a whopping 10,752 and networking — completed a training benchmark based on a GPT-3 model with 175 billion parameters trained on one billion tokens in just 3.9 minutes. NVIDIA Eos NVIDIA H100 Tensor Core GPUs NVIDIA Quantum-2 InfiniBand That’s a nearly 3x gain from 10.9 minutes, the record NVIDIA set when the test less than six months ago. By extrapolation, Eos could now train that LLM in just eight days, 73x faster than a prior state-of-the-art system using 512 A100 GPUs. As for the Stable Diffusion benchmark, it took 1,024 2.5 minutes to complete it. was introduced NVIDIA Hopper architecture GPUs But that’s not all. As NVIDIA notes, the company was the only one to run all MLPerf tests, demonstrating the fastest performance and the greatest scaling in each of the nine benchmarks. In MLPerf HPC, a separate benchmark for AI-assisted simulations on supercomputers, H100 GPUs delivered up to twice the performance of NVIDIA A100 Tensor Core GPUs in . the last HPC round Options for training AI models Now, let’s start unpacking these results. The first thing to note is the various dimensions of scale. When , it featured 4,608 H100s. Today, it features 10,752. But NVIDIA is not the only one to leverage Eos scale and performance. Eos was first announced As the company notes, a full-stack platform of innovations in accelerators, systems and software was used by both Eos and Microsoft Azure in the latest round. Azure did not submit in all categories, but in the GPT-3 benchmark where both submitted, results were practically identical. And Azure’s instance is commercially available too. What’s more, the scaling efficiency for Eos was north of 80%. Ideally, double the number of GPUs would get twice the performance. Getting 80% of that, at this scale, is quite a feat. NVIDIA attributed this to its stack – the combination of hardware, software, and networking. One takeaway here is that “ “, the moniker used to describe the performance and scale-up that NVIDIA GPUs achieve, seems to be still in effect. But perhaps the real question is who should care, and why. Jensen’s Law That kind of scale is not something anyone but the hyperscalers could normally handle, even if they wanted to. NVIDIA H100 GPUs are in short supply despite costing around $30K each. As the notes, organizations are in a stockpiling race. But there are good news as well. State of AI in 2023 report First off, NVIDIA chips have remarkably long lifetime value: 5 years from launch to peak popularity. NVIDIA V100, released in 2017, is still the most commonly used chip in AI research. This suggests A100s, released in 2020, could peak in 2026 when the V100 is likely to hit its trough. Plus, it’s questionable whether training a new Gen AI model from scratch is something most organizations will need to do. The majority of organizations will probably either only use pre-trained Gen AI models packaged under the hood to power applications, or choose to use something like ChatGPT over an API. Both of these options require exactly zero GPUs. The flip side, of course, is that both of these options also provide zero autonomy and safety. But even for organizations that choose to develop in-house Gen AI, training something from scratch is probably not what makes the most sense for most. Taking an off-the-shelf open source Gen AI model and customizing it via fine-tuning or RAG (Retrieval Augmented Generation) is way faster and easier, and only requires a fraction of the compute. How NVIDIA competitors may catch up Either way, the long view here is that scaling up the way NVIDIA does makes more powerful AI models possible at a shorter time. We can expect results to trickle down, whether that means more powerful GPT-like models, open source models, or derivative applications. But there’s another set of questions to consider here. Is NVIDIA’s dominance a good thing for the industry? Can, and should, it last? What is the competition up to? And why should the rest of the world care? As myself and others have been noting, NVIDIA’s dominance is based not just on its hardware, but on the entirety of its stack. Furthermore, as , NVIDIA also leverages a set of business tactics with regards to supply chain management, sales strategies and bundling which few others are able to replicate. But that does not mean that the competition is idling either. noted by analyst Dylan Patel As far as supercomputers and scaling up go, NVIDIA’s Eos is definitely not the only game in town. As Sparks mentioned, featuring 60,000 of its own Ponte Vecchio GPUs is about to go online. Plus there are many other supercomputers in the world featuring a range of chips and architectures from different makers, and they are all capable of doing high-performance floating point arithmetic. Intel’s Aurora NVIDIA has an edge due to the fact that it was the first to focus on AI workloads, but each of its aspiring competitors has a roadmap to catch up. Until recently we used to think that CUDA, NVIDIA’s software layer, was the company’s biggest moat. As , many machine learning frameworks have come and gone, but most have relied heavily on leveraging NVIDIA’s CUDA and performed best on NVIDIA GPUs. However, with the arrival of PyTorch 2.0 and OpenAI’s Triton, NVIDIA’s dominant position in this field, mainly due to its software moat, is being disrupted. These frameworks make it easier for NVIDIA’s competition to build their own stack. Patel notes Of course, as Patel adds in a different note outlining , NVIDIA isn’t sitting on their hands. While NVIDIA is extremely successful, they are also one of the most paranoid firms in the industry, with CEO Jensen Huang embodying the spirit of Andy Grove. It’s no accident that NVIDIA highlighted that its team currently employs twice as many software engineers than hardware engineers. NVIDIA’s own plan to stay ahead of the pack Success breeds complacency. Complacency breeds failure. Only the paranoid survive. Andy Grove Competition, scale, performance, and TCO Patel goes as far as to question some of NVIDIA’s tactics, which is something we don’t have an opinion on. What we can say is that even though NVIDIA’s relentlessness does not let them go complacent, having any single vendor own is not very healthy. It will probably be a good thing for everyone to see the competition catch up. over 80% market share for very long At this point, hyperscalers, incumbent competition such as AMD and Intel as well as a flock of upstarts are all working on their own custom AI chips for 2024 and beyond. It’s estimated that , which are also at short supply. No wonder that everyone wants to have a piece of the action and/or grow their autonomy. For consumers, . NVIDIA has a 1000% margin on H100s more competition will mean more choice and autonomy, as well as better performance and prices For the time being however, NVIDIA is still the undisputed leader – albeit with a footnote or two. When asked to directly compare NVIDIA’s MLPerf results with Intel’s Gaudi, for example, director of product marketing in the Accelerated Computing Group at NVIDIA Dave Salvator pointed out two things. First, Gaudi submissions were nowhere near the 10K scale. Second, NVIDIA results were about 2X better compared on a normalized basis. Others such as analyst Karl Freund, however, . consider Gaudi2 a credible alternative Footnote #1: MLPerf is a widely acclaimed benchmark in the industry. Like all benchmarks, however, it’s not perfect. As Sparks noted, one crucial element missing from MLPerf is pricing. While it’s understandable that incorporating pricing in any benchmark is tricky for a number of reasons, it also means that results need to be put in context. For example, as per Patrick Kennedy’s analysis, . Intel’s Gaudi2 has 4x better performance per dollar than NVIDIA’s H100 Footnote #2: Performance alone is rarely the only metric that matters to prospective buyers. More often than not, what matters most is the performance to cost ratio: how much does it cost to perform a certain operation within a certain timeframe. To arrive at that metric, the total cost of ownership (TCO) for AI chips should be factored in. That is a complex exercise that requires deep expertise. A big part of the TCO for AI chips is inference, i.e. the use of trained AI models in production. Training an AI model is typically a costly and complex endeavor. Inference may be simpler in comparison, but it typically constitutes the bulk of a model’s lifetime and operational cost. Training and inference workloads have different characteristics. This means that a system that does well at training does not necessarily do equally well at inference. Case in point – when Salvator was asked to comment on Eos performance on inference, he referred attendees to future briefings. Meanwhile, people are , while others are . building new systems focused on inference trying to make the most of existing ones Conclusion NVIDIA just showcased that its leadership does not show signs of waning in the immediate future. However, that’s not necessarily a good thing for the rest of the world. The competition is there, and the chance to catch up is there too, distant as it may seem at this point. AI chips in 2024 will be something to keep an eye on. In any case, how benchmark highlights translate to actual impact, usability and TCO for organizations aspiring to develop and use AI is not linear. Join the Orchestrate all the Things Newsletter Stories about how Technology, Data, AI and Media flow into each other shaping our lives.Analysis, Essays, Interviews and News. Mid-to-long form, 1-3 times per month. Also published . here

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

Hot off the press! This story contains factual information about a recent event.

AI Chips in 2024: Is Nvidia Poised to Lead The Race?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Aerospike Graph: the Latest Entry in the Graph Database Market

The Noonification: Does Size Matter If You an Have iPhone 15? (11/11/2023)

AI Boom Propels Nvidia Revenue to Unprecedented Levels, Stunning Wall Street

$PEPE, a Purple Lamborghini, and More: The Story Continues

102 Most Important Webism Quotes by Unknown Authors

100 Days of AI Day 3: Leveraging AI for Prompt Engineering and Inference

Aerospike Graph: the Latest Entry in the Graph Database Market

The Noonification: Does Size Matter If You an Have iPhone 15? (11/11/2023)

AI Boom Propels Nvidia Revenue to Unprecedented Levels, Stunning Wall Street

$PEPE, a Purple Lamborghini, and More: The Story Continues

102 Most Important Webism Quotes by Unknown Authors

100 Days of AI Day 3: Leveraging AI for Prompt Engineering and Inference

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps