Building an H.264 Decoder with Nvidia CUDA

More than 10 years have passed since I wrote my last post on the topic of developing an H.264 video encoder. Since then, technologies have advanced significantly, and I haven't stood still either, moving forward at a steady pace. The main problem with my encoder was performance. It only allowed encoding video in 352x288 pixel format, while the industry had already moved to completely different resolutions — HD, 4K, ... And although using modern processors and SIMD instructions in C++ improved the situation, it was clear that this wasn't enough, especially for parallel processing of multiple video streams. Plus, I needed not only an encoder, but also a decoder. I had to look at other solutions. These looked as follows:

1. Third-party microchips.

2. Intel Media SDK.

3. Nvidia CUDA.

I faced this choice about 10 years ago when I was at a crossroads — which path to choose? Let's break it down point by point.

1. Third-Party Microchips

Pros: The chips themselves are cheap.

Cons: lots of hidden pitfalls. It wasn’t clear how to integrate them with a PC, full documentation was required, and I would have had to design a custom PCIe expansion board. Datasheets were another issue: while in the past they were publicly available, nowadays they’re often provided only upon request, with strict requirements to explain the project, use case, etc. For a hobby project, this was simply too expensive and complicated. This option was rejected.

2. Intel Media SDK

Pros: everything is inside the CPU, and all you need is to learn the SDK.

Cons: not all CPUs are supported, and those that did were quite expensive at the time. My workstation had no Intel Media SDK support. Another limitation: it’s just an SDK, meaning the choice of algorithms was restricted to what Intel provided. It wasn’t even clear if live streaming was supported - this question needed to be studied separately. Overall, not a bad option, but I moved on to the next one.

3. Nvidia CUDA

Pros: GPUs are relatively cheap and can be used with an existing workstation. CUDA is essentially C++, so all encoding/decoding algorithms can be implemented in whatever way you like — you’re basically writing C++ code. From there, you have two paths: use Nvidia’s Video Codec SDK or write everything directly in CUDA C++. At the time, my main task was decoding live H.264 video streams from satellite feeds. But the Video Codec SDK had a major drawback: it wasn’t designed for live streams, only for files. That left CUDA as the only option.

Cons: it requires a lot of mental effort to implement. And it wasn’t clear whether I’d even achieve the necessary performance. But you never know until you try (spoiler: it worked). In the end, the pros outweighed the cons, and I chose Nvidia CUDA.

So, I got to work. After about a year of focused development, I had a working prototype of an H.264 decoder. It took another year to polish it for practical use, improve performance for multi-stream processing, and fix bugs.

4. Why an H.264 Decoder Can't Be Built with CUDA Alone

The H.264 decoding process has two stages. The first uses the CABAC entropy decoder, and the second works at the macroblock level. Macroblock decoding is highly parallelizable, but entropy decoding is inherently sequential — each step depends on the previous one. This makes it completely unsuitable for parallelization.

I tried implementing CABAC decoding on CUDA, but performance was terrible. With the Nvidia Video Codec SDK, entropy decoding is handled by a dedicated hardware block on the GPU, but it can’t be accessed independently. And when using CUDA, that crucial hardware block is not usable. The only realistic option for implementing the entropy part of the decoder would be to move it onto an FPGA.

So, if FPGA is unavoidable, why not go deeper into digital signal processing and build a full DVB-S/DVB-S2 decoder on top of SDR (Software Defined Radio)? That’s exactly where I’m headed — stay tuned for my next post on building a DVB-S decoder with SDR.