Building an H.264 Decoder with Nvidia CUDA

Written by hacker84199059 | Published 2025/09/23
Tech Story Tags: cuda | gpu-video-encoding | nvidia | c++ | nvidia-cuda | video-codec-development | parallel-video-processing | fpga-entropy-decoding

TLDRMore than a decade after first experimenting with H.264 encoding, the author revisits the challenge of building a performant video decoder. After weighing options like third-party chips and Intel’s Media SDK, Nvidia CUDA emerged as the most flexible solution. A year of development produced a working H.264 decoder, but entropy decoding revealed CUDA’s limits—pointing to FPGA and even full DVB-S/DVB-S2 decoding with SDR as the next frontier.via the TL;DR App

More than 10 years have passed since I wrote my last post on the topic of developing an H.264 video encoder. Since then, technologies have advanced significantly, and I haven't stood still either, moving forward at a steady pace. The main problem with my encoder was performance. It only allowed encoding video in 352x288 pixel format, while the industry had already moved to completely different resolutions — HD, 4K, ... And although using modern processors and SIMD instructions in C++ improved the situation, it was clear that this wasn't enough, especially for parallel processing of multiple video streams. Plus, I needed not only an encoder, but also a decoder. I had to look at other solutions. These looked as follows:


1. Third-party microchips.

2. Intel Media SDK.

3. Nvidia CUDA.


I faced this choice about 10 years ago when I was at a crossroads — which path to choose? Let's break it down point by point.


1. Third-Party Microchips

Pros: The chips themselves are cheap.


Cons: lots of hidden pitfalls. It wasn’t clear how to integrate them with a PC, full documentation was required, and I would have had to design a custom PCIe expansion board. Datasheets were another issue: while in the past they were publicly available, nowadays they’re often provided only upon request, with strict requirements to explain the project, use case, etc. For a hobby project, this was simply too expensive and complicated. This option was rejected.

2. Intel Media SDK

Pros: everything is inside the CPU, and all you need is to learn the SDK.


Cons: not all CPUs are supported, and those that did were quite expensive at the time. My workstation had no Intel Media SDK support. Another limitation: it’s just an SDK, meaning the choice of algorithms was restricted to what Intel provided. It wasn’t even clear if live streaming was supported - this question needed to be studied separately. Overall, not a bad option, but I moved on to the next one.

3. Nvidia CUDA

Pros: GPUs are relatively cheap and can be used with an existing workstation. CUDA is essentially C++, so all encoding/decoding algorithms can be implemented in whatever way you like — you’re basically writing C++ code. From there, you have two paths: use Nvidia’s Video Codec SDK or write everything directly in CUDA C++. At the time, my main task was decoding live H.264 video streams from satellite feeds. But the Video Codec SDK had a major drawback: it wasn’t designed for live streams, only for files. That left CUDA as the only option.


Cons: it requires a lot of mental effort to implement. And it wasn’t clear whether I’d even achieve the necessary performance. But you never know until you try (spoiler: it worked). In the end, the pros outweighed the cons, and I chose Nvidia CUDA.


So, I got to work. After about a year of focused development, I had a working prototype of an H.264 decoder. It took another year to polish it for practical use, improve performance for multi-stream processing, and fix bugs.

4. Why an H.264 Decoder Can't Be Built with CUDA Alone

The H.264 decoding process has two stages. The first uses the CABAC entropy decoder, and the second works at the macroblock level. Macroblock decoding is highly parallelizable, but entropy decoding is inherently sequential — each step depends on the previous one. This makes it completely unsuitable for parallelization.


I tried implementing CABAC decoding on CUDA, but performance was terrible. With the Nvidia Video Codec SDK, entropy decoding is handled by a dedicated hardware block on the GPU, but it can’t be accessed independently. And when using CUDA, that crucial hardware block is not usable. The only realistic option for implementing the entropy part of the decoder would be to move it onto an FPGA.


So, if FPGA is unavoidable, why not go deeper into digital signal processing and build a full DVB-S/DVB-S2 decoder on top of SDR (Software Defined Radio)? That’s exactly where I’m headed — stay tuned for my next post on building a DVB-S decoder with SDR.


Written by hacker84199059 | Bring deep expertise in hardware design, parallel computing and video solutions.
Published by HackerNoon on 2025/09/23