In this article, we'll go through some fundamental low-level details to understand why GPUs are good at Graphics, Neural Networks, and Deep Learning tasks and CPUs are good at a wide number of sequential, complex general purpose computing tasks. There were several topics that I had to research and get a bit more granular understanding for this post, some of which I will be just mentioning in passing. It is done deliberately to focus just on the absolute basics of CPU & GPU processing.
Earlier computers were dedicated devices. Hardware circuits and logic gates were programmed to do a specific set of things. If something new had to be done, circuits needed to be rewired. "Something new" could be as simple as doing mathematical calculations for two different equations. During WWII, Alan Turing was working on a programmable machine to beat the Enigma machine and later published the "Turing Machine" paper. Around the same time, John von Neumann and other researchers were also working on an idea which fundamentally proposed:
We know that everything in our computer is binary. String, image, video, audio, OS, application program, etc., are all represented as 1s & 0s. CPU architecture (RISC, CISC, etc.,) specifications have instruction sets (x86, x86-64, ARM, etc.,), which CPU manufacturers must comply with & are available for OS to interface with hardware.
OS & application programs including data are translated into instruction sets and binary data for processing in the CPU. At the chip level, processing is done at transistors and logic gates. If you execute a program to add two numbers, addition (the "processing") is done at a logic gate in the processor.
In CPU as per Von Neumann architecture, when we are adding two numbers, a single add instruction runs on two numbers in the circuit. For a fraction of that millisecond, only add instruction was executed in the (execution) core of the processing unit! This detail always fascinated me.
The components in the above diagram are self-evident. For more details and detailed explanation refer to this excellent article. In modern CPUs, a single physical core can contain more than one integer ALU, floating-point ALU, etc., Again, these units are physical logic gates.
We need to understand the 'Hardware Thread' in the CPU core for a better appreciation of GPU. A hardware thread is a unit of computing that can be done in execution units of a CPU core, every single CPU clock cycle. It represents the smallest unit of work that can be executed in a core.
The above diagram illustrates the CPU instruction cycle/machine cycle. It is a series of steps that the CPU performs to execute a single instruction (eg: c=a+b).
Fetch: Program counter (special register in CPU core) keeps track of which instruction must be fetched. Instruction is fetched and stored in the instruction register. For simple operations, corresponding data is also fetched.
Decode: Instruction is decoded to see operators and operands.
Execute: Based on the operation specified, the appropriate processing unit is chosen and executed.
Memory Access: If an instruction is complex or additional data is needed (several factors can cause this), memory access is done before execution. (Ignored in the above diagram for simplicity). For a complex instruction, initial data will be available in data register of the compute unit, but for complete execution of instruction, data access from L1 & L2 cache is required. This means could be a small wait time before the compute unit executes and the hardware thread is still holding compute unit during wait time.
Write Back: If execution produces output (eg: c=a+b), the output is written back to register/cache/memory. (Ignored in the above diagram or anyplace later in the post for simplicity)
In the above diagram, only at t2 is compute being done. The rest of the time, the core is just idle (we are not getting any work done).
Modern CPUs have HW components that essentially enable (fetch-decode-execute) steps to occur concurrently per clock cycle.
A single hardware thread can now do computation in every clock cycle. This is called instruction pipelining.
Fetch, Decode, Memory Access, and Write Back are done by other components in a CPU. For lack of a better word, these are called "pipeline threads". The pipeline thread becomes a hardware thread when it is in the execute stage of an instruction cycle.
As you can see, we get compute output every cycle from t2. Previously, we got compute output once every 3 cycles. Pipelining improves compute throughput. This is one of the techniques to manage processing bottlenecks in Von Neumann Architecture. There are also other optimizations like out-of-order execution, branch prediction, speculative execution, etc.,
This is the last concept I want to discuss in CPU before we conclude and move on to GPUs. As the clock speeds increased, the processors also got faster and more efficient. With the increase in application (instruction set) complexity, CPU compute cores were underutilized and it was spending more time waiting on memory access.
So, we see a memory bottleneck. The compute unit is spending time on memory access and not doing any useful work. Memory is several orders slower than CPU and the gap is not going to close anytime soon. The idea was to increase memory bandwidth in some units of a single CPU core and keep data ready to utilize the compute units when it is awaiting memory access.
Hyper-threading was made available in 2002 by Intel in Xeon & Pentium 4 processors. Prior to hyper-threading, there was only one hardware thread per core. With hyper-threading, there will be 2 hardware threads per core. What does it mean? Duplicate processing circuit for some registers, program counter, fetch unit, decode unit, etc.
The above diagram just shows new circuit elements in a CPU core with hyperthreading. This is how a single physical core is visible as 2 cores to the Operating System. If you had a 4-core processor, with hyper-threading enabled, it is seen by OS as 8 cores. L1 - L3 cache size will increase to accommodate additional registers. Note that the execution units are shared.
Assume we have processes P1 and P2 doing a=b+c, d=e+f, these can be executed concurrently in a single clock cycle because of HW threads 1 and 2. With a single HW thread, as we saw earlier, this would not be possible. Here we are increasing memory bandwidth within a core by adding Hardware Thread so that, the processing unit can be utilized efficiently. This improves compute concurrency.
Some interesting scenarios:
Check this article & also try the Colab notebook. It shows how matrix multiplication is a parallelizable task and how parallel compute cores can speed up the calculation.
As computing power increased, so did the demand for graphics processing. Tasks like UI rendering and gaming require parallel operations, driving the need for numerous ALUs and FPUs at the circuit level. CPUs, designed for sequential tasks, couldn't handle these parallel workloads effectively. Thus, GPUs were developed to fulfill the demand for parallel processing in graphics tasks, later paving the way for their adoption in accelerating deep learning algorithms.
I would highly recommend:
Cores, hardware threads, clock speed, memory bandwidth, and on-chip memory of CPUs & GPUs differ significantly. Example:
This number is used for comparison with GPU as getting peak performance of general-purpose computing is very subjective. This number is a theoretical max limit which means, FP64 circuits are being used to their fullest.
Terminologies we saw in CPU don't always translate directly to GPUs. Here we'll see components and core NVIDIA A100 GPU. One thing that was surprising to me while researching for this article was that CPU vendors don't publish how many ALUs, FPUs, etc., are available in execution units of a core. NVIDIA is very transparent about the number of cores and the CUDA framework gives complete flexibility & access at the circuit level.
In the above diagram in GPU, we can see that there is no L3 Cache, smaller L2 cache, smaller but a lot more control unit & L1 cache and a large number of processing units.
Here are the GPU components in the above diagrams and their CPU equivalent for our initial understanding. I haven't done CUDA programming, so comparing it with CPU equivalents helps with initial understanding. CUDA programmers understand this very well.
Graphics and deep learning tasks demand SIM(D/T) [Single instruction multi data/thread] type execution. i.e., reading and working on large amounts of data for a single instruction.
We discussed instruction pipelining and hyper-threading in CPU and GPUs also have capability. How it is implemented and working is slightly different but the principles are the same.
Unlike CPUs, GPUs (via CUDA) provide direct access to Pipeline Threads (fetching data from memory and utilizing the memory bandwidth). GPU schedulers work first by trying to fill compute units (including associated shared L1 cache & registers for storing compute operands), then "pipeline threads" which fetch data into registers and HBM. Again, I want to emphasize that CPU app programmers don't think about this, and specs about "pipeline threads" & number of compute units per core is not published. Nvidia not only publishes these but also provides complete control to programmers.
I will go into more detail about this in a dedicated post about the CUDA programming model & "batching" in model serving optimization technique where we can see how beneficial this is.
The above diagram depicts hardware thread execution in the CPU & GPU core. Refer "memory access" section we discussed earlier in CPU pipelining. This diagram shows that. CPUs complex memory management makes this wait time small enough (a few clock cycles) to fetch data from the L1 cache to registers. When data needs to be fetched from L3 or main memory, the other thread for which data is already in register (we saw this in the hyper-threading section) get's control of execution units.
In GPUs, because of oversubscription (high number of pipeline threads & registers) & simple instruction set, a large amount of data is already available on registers pending execution. These pipeline threads waiting for execution become hardware threads and do the execution as often as every clock cycle as pipeline threads in GPUs are lightweight.
What's over goal?
This is the main reason why the latency of matrix multiplication of smaller matrices is the same more or less in CPU & GPU. Try it out.
Tasks needs to be parallel enough, data needs to be huge enough to saturate compute FLOPs & memory bandwidth. If a single task is not big enough, multiple such tasks needs to be packed to saturate memory and compute to fully utilize the hardware.
Compute Intensity = FLOPs / Bandwidth. i.e., Ratio of amount of work that can be done by the compute units per second to amount of data that can be provided by memory per second.
In above diagram, we see that compute intensity increases as we go to higher latency and lower bandwidth memory. We want this number to be as small as possible so that compute is fully utilized. For that, we need to keep as much as data in L1 / Registers so that compute can happen quickly. If we fetch single data from HBM, there are only few operations where we do 100 operations on single data to make it worth it. If we don't do 100 operations, compute units were idle. This is where high number of threads and registers in GPUs come into play. To keep as much as data in L1/Registers to keep the compute intensity low and to keep parallel cores busy.
There is a difference in compute intensity of 4X between CUDA & Tensor cores because CUDA cores can done only one 1x1 FP64 MMA where as Tensor cores can do 4x4 FP64 MMA instruction per clock cycle.
High number of compute units (CUDA & Tensor cores), high number of threads and registers (over subscription), reduced instruction set, no L3 cache, HBM (SRAM), simple & high throughput memory access pattern (compared to CPU's - context switching, multi layer caching, memory paging, TLB, etc.,) are the principles that make GPUs so much better than CPUs in parallel computing (graphics rendering, deep learning, etc.,)
GPUs were first created for handling graphics processing tasks. AI researchers started taking advantage of CUDA and its direct access to powerful parallel processing via CUDA cores. NVIDIA GPU has Texture Processing, Ray Tracing, Raster, Polymorph engines, etc., (let's say graphics-specific instruction sets). With the increase in adoption of AI, Tensor cores which are good at 4x4 matrix calculation (MMA instruction) are being added which are dedicated to deep learning.
Since 2017, NVIDIA has been increasing the number of Tensor cores in each architecture. But, these GPUs are also good at graphics processing. Although the instruction set and complexity is much less in GPUs, it's not fully dedicated to deep learning (especially Transformer Architecture).
FlashAttention 2, a software layer optimization (mechanical sympathy for the attention layer's memory access pattern) for transformer architecture provides 2X speedup in tasks.
With our in-depth first principles-based understanding of CPU & GPU, we can understand the need for Transformer Accelerators: A dedicated chip (circuit only for transformer operations), with even large number of compute units for parallelism, reduced instruction set, no L1/L2 caches, massive DRAM (registers) replacing HBM, memory units optimized for memory access pattern of transformer architecture. After all LLMs are new companions for humans (after web and mobile), and they need dedicated chips for efficiency and performance.