paint-brush
Understanding the Roofline Model by@durganshu
3,463 reads
3,463 reads

Understanding the Roofline Model

by Durganshu MishraDecember 22nd, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This blog discusses the fundamentals and complexities of the Roofline performance model. Apart from the theoretical concepts, a hands-on example based on matrix multiplication in C++ (with all the relevant calculations) is showcased for better understanding.

Company Mentioned

Mention Thumbnail
featured image - Understanding the Roofline Model
Durganshu Mishra HackerNoon profile picture


There’s no denying the fact that the performance of our code carries a lot of importance. For countless applications, including HPC simulations, gaming, graphics, etc., performance does not just mean faster execution and less waiting times. It also directly relates to cost, energy usage, and user experience. A better-performing code often saves a lot of computational resources, potentially leading to cost-cutting measures, especially when running in supercomputing and data centers.


However, the journey to improve performance can be a formidable challenge, with complexities that demand a strategic approach. Before enhancing performance, a crucial first step is to gauge the theoretical limits of achievable performance. This ensures the programmer knows what he’s aiming for and how far he can go with the performance optimization.


This is where the performance models like the roofline model emerge as a guiding beacon. By providing insights into the theoretical performance limits and pinpointing bottlenecks, the roofline model becomes an invaluable tool for computational excellence, allowing developers to make informed decisions on performance optimization strategies.


In this story, I’ll dig deeper into what the roofline model is all about. We’ll also go through a detailed example to showcase how to identify whether our algorithm is compute or memory-bound. The roofline model allows us to determine if our algorithm runs at its peak potential or requires more improvement.


Source: Lets Go GIF by The Roku Channel — Find & Share on GIPHY


The rooflines

The theoretical limits (or the rooflines) come mainly from the machine's architecture and memory characteristics. As discussed earlier in my blog post on the necessity of parallelism, it is up to the programmer to exploit the hardware’s capability for the maximal performance of the algorithm.


Two major upper limits (or rooflines) for the achievable performance are:

  1. Peak performance
  2. Memory throughput/bandwidth


Peak performance [operations/s]

Applicable peak performance is dependent on the computational resources available on the hardware. Different architectural factors affect that, such as the processor’s clock frequency and pipelining implementation.


This roofline also depends on the kinds of operations our code performs. For instance, the rooflines may differ for one code with fused multiply-add (FMA) operations and another with basic arithmetic operations. The capability of the algorithm to use vectorization or SIMD will also change its performance.


Memory throughput/bandwidth

Another ceiling comes from the fact that only a limited amount of data can be transferred between the memory and the CPU.


This roofline is dependent on two aspects: the code-dependent computational (or arithmetic or operational) intensity [operations/byte] and the hardware-dependent peak bandwidth [byte/s].


What is computational intensity?

Computational intensity determines the number of bytes transferred between the memory and the CPU for every floating-point operation in the algorithm. Put simply, it is the ratio of the total number of floating-point operations in your algorithm to the number of bytes transferred. Thus, measured as operations/byte.


What about the peak memory bandwidth?

Modern-day computers have an intricate memory hierarchy. A high-level representation of the memory hierarchy is as shown below:



The memory hierarchy. Source: Dive Into Systems



There are several data cache levels (mostly L1 to L3), followed by DRAM. The memory bandwidth of each of these memories varies considerably. While the Level 1 cache (L1) has the highest memory bandwidth, DRAM has the lowest. However, the capacity of L1 is substantially low (mainly in the orders of KBs in modern personal computer processors). In contrast, DRAM can be significantly larger (in the order of several GBs).


A set of benchmarks like the STREAM benchmark can be used to identify the peak memory bandwidth.


The Naïve roofline model

Having discussed the basics, let’s proceed to how we utilize all this information to estimate our peak performance.


A simple roofline model (the so-called Naïve roofline model) can be represented as:


The Naïve roofline model


Pₘₐₓ: upper bound on the performance [operations/s]


Machine dependent parameters:

Pₚₑₐₖ: applicable peak performance [operations/s]

bₘₐₓ: applicable peak bandwidth [byte/s]


Code dependent parameter:

I: computational intensity [operations/byte]



As discussed above, the bottleneck can be the execution of work (Pₚₑₐₖ) or the data path (I*bₘₐₓ).



Assumptions behind the model

Several vital considerations contribute to the simplicity and effectiveness of the roofline model.

First and foremost, the model assumes an ideal scenario where data transfer and core execution are perfectly overlapped, emphasizing the importance of seamless coordination between these two aspects of computation.


The model operates on the principle that the slowest limiting factor, whether core execution or data transfer, dictates overall performance. In this competition, the “winner” is the factor with the most significant impact, while others are presumed to have negligible effects.


Notably, latency effects are disregarded, implying an assumption of a perfect streaming mode where data transfer is already in progress.


Finally, the model focuses on “steady-state” code execution, excluding wind-up and wind-down effects and assuming an entire pipeline, adding to its simplicity.


Despite its simplicity, the roofline model proves highly effective by encapsulating this complexity, offering a practical tool for communication between High-Performance Computing (HPC) experts and application scientists.


Refined roofline model

The Naïve roofline model can be refined and improved in many ways. For starters, adding multiple rooflines for different kinds of arithmetic operations can improve the roofline model.

Similarly, using different memory bandwidths for various available memories makes the model more expressive.


An example is shown below:

A roofline model with multiple ceilings. Source: Intel® Advisor Roofline


While doing theoretical calculations (as we’ll do very soon), the Naïve roofline model neglects latency effects. However, if we have sufficient information, we can include latency effects in our calculations, improving our predictions.


Roofline model in action

Let’s use what we learned above to apply the roofline model to a real problem.


This example is inspired by my learnings while studying one of my favorite courses at the Technical University of Munich: Advanced Programming by the TUM SCCS.


My machine has a single-core CPU with 2GHz frequency (pretty oldish, right?) and the following specifications:


  1. This CPU can perform 4 double-precision (8 bytes) floating point operations simultaneously (i.e., it is vectorized).


2. The latency of basic arithmetic operations (+, -, *) is 2 cycles (one operation per two cycles).


3. The latency of the fused multiply add operation (a += b * c) is 3 cycles.


4. These operations are not pipelined.


5. This CPU has 2 load units and 1 store unit, and each unit can accommodate 32 bytes (4 doubles).


6. Accessing the cache memory is pipelined and costs 5 cycles. The cache operates at 2 GHz.


7. Accessing the main memory is not pipelined and costs 10 cycles. The main memory operates at 800 MHz.


8. Both types of memory can either load 2 vector registers (2 × 32 bytes) or store 1 vector register (32 bytes) simultaneously.


9. Assume that one load/store unit of the CPU can load/store one vector register.


Given all the architectural specifications, we can first calculate the hardware-dependent parameters of our model.


After doing so, we can look at our code and proceed with the calculations of computational intensity.


Calculating the applicable peak performance

The theoretical peak performance will depend on the kind of operations we are doing.


For basic arithmetic operations:

As the CPU can perform 4 double-precision floating point operations simultaneously and the latency of basic arithmetic operations is 2 cycles, the theoretical peak performance can be measured as:



Theoretical peak performance for basic arithmetic operations



For fused-multiplication-addition (FMA) operations:

For FMA operations, there would be 2 FLOP/item (addition + multiplication). The latency, however, is 3 cycles. Thus, the theoretical peak performance is:


Theoretical peak performance for FMA operations


Calculating the memory bandwidth

Unfortunately, calculating the applicable memory bandwidth is a bit more complicated. As discussed above, different types of memories can have different bandwidth values (depending on their size and proximity to the processor). So, to accurately model the rooflines, it is essential to consider as many of these as possible. For our example, we’ll focus on the cache and main memory.


Theoretical read bandwidth of the memory:

Neglecting the latency effects due to the connection with the CPU, the read bandwidth of the main memory and cache will depend on their capacity and clock frequency.


Both kinds of memory can either load 2 vector registers (2 × 32 bytes) or store 1 vector register (32 bytes) simultaneously (8th specification above).


For main memory (without latency effects):


Theoretical read bandwidth of main memory (without latency effects)


For Cache (without latency effects):


Theoretical read bandwidth of the cache (without latency effects)


Similar calculations can be done for the theoretical write bandwidth.


Applicable read/write bandwidth of the memory:

In any real case scenario, the connection between the CPU and the memory (both cache and main memory) will bring in some latency effects.


Because of these effects, the CPU can use only a fraction of the available bandwidth. So, it is essential to consider these effects.



For main memory read (with latency effects):


Applicable read bandwidth for the main memory (with latency effects)



For main memory write (with latency effects):


Applicable write bandwidth for the main memory (with latency effects)


As mentioned in the specifications, access to the cache memory is pipelined (6th specification above). If the pipeline is full, the latency effects can be ignored for our calculations.



For cache read (with pipelined access):


Applicable read bandwidth for the cache (with pipelined access)



For cache write (with pipelined access):


Applicable write bandwidth for the cache (with pipelined access)


Calculating the computational intensity

Let’s consider an application whose core computation involves a matrix-matrix multiplication function. Here, the inner-most loop over i computes the (row,col) element of matrix result by taking the dot product of the row-th row of left with the col-th column of right.


Here is the complete code:

// result[row,column] = dot_product(left[row, :], right[:, col])
void matrix_mult(size_t rows, size_t columns, size_t k,
             double *left, double *right, double *result) {
  for (auto row = 0; row < rows; ++row) {
   for (auto col = 0; col < columns; ++col) {
     for (auto i = 0; i < k; ++i) {
       if (i == 0) {
         result[row*columns + col] = left[row*k + i] * right[i*columns + col];
       }  
       else {
         result[row*columns + col] += left[row*k + i] * ri[i*columns + col];
       }
    }
  }
 }
}


As discussed above,


Computational intensity determines the number of bytes transferred between the memory and the CPU for every floating-point operation in the algorithm. Put simply, it is the ratio of total number of floating-point operations in your algorithm to the number of bytes transferred. Thus, measured as operations/byte.


Let’s start with the number of floating-point operations.


Total number of floating-point operations:

In total, the loops run for rows*columns*k iterations.


However, for all i = 0, only multiplications are performed. So, a total of rows*columns multiplications are performed.


For the remaining iterations (that go inside the else part), FMA (fused-multiply-add) operations are performed. In total, there are (k-1) *rows*columns FMA operations. Each FMA operation consists of 2 floating-point operations. Thus, we have:


Total multiplications: rows*columns

Total FMA operations: (k-1)*rows*columns

Total floating-point operations:

(2k-2)*rows*columns + rows*columns = (2k-1)*rows*columns … (i)


Total memory transactions:

The next step is identifying the number of bytes transferred between the memory and the processor.


This can be calculated by determining the total number of memory transactions between the two.


Note: For simplification, we’ll assume that (a += b * c) does not require memory interaction for the intermediate result b * c.


The algorithm comprises several read-and-write operations over the left, right, and result matrices. To calculate the total number of transactions, we need to consider all these (and potentially more).


Let’s begin!


left matrix is read in every iteration. So, the total reads of the left matrix sum up to rows*columns*k.

Similarly, the right matrix is also read these many times.


Is that all?


No.


The result matrix is also read for FMA operations (_a =_ **_a_** _+ b * c_). The total reads amount to rows*columns*(k-1).


Finally, the result matrix is accessed a total of rows*columns*k times for writing operations.


To summarize:


Total read operations:

2*rows*columns*k + rows*columns*(k-1) = rows*columns*(3k-1) … (j)


Total write operations:

rows*columns*k … (k)


Thus, the computational intensity can be calculated as:


For reading data:

Using (i) and (j):

Computational Intensity w.r.t reading data


For writing data:

Using (i) and (k):

Computational Intensity w.r.t writing data



Putting it all together

Phew! We’re almost there.


Having calculated all the elements of our roofline model, let’s put everything together and see how our algorithm is performing and what the limiting component of the system is.


Again, we break down everything in two cases.


When the cache is not involved

The reading speed can be calculated using the computational intensity calculated in (l) and the applicable read bandwidth for the main memory in (e):


Memory read performance when cache is not involved.


Similarly, the writing speed can be calculated using the computational intensity calculated in (m) and the applicable write bandwidth for the main memory in (f):


Memory write performance when cache is not involved.


Since most of the computations in the innermost loop use the fused-multiply-add instruction, we will take the FMA-peak as the peak floating-point performance of our CPU for both cases.


Thus, using (b), the peak performance of our algorithm can be predicted as:


Performance when the data is read from the main memory.


Performance when the data is written to the main memory.


Looking at these numbers, it can be clearly seen that our algorithm is memory bound. It can do more floating-point operations, but our processor is not able to read/write the data fast enough.



When the cache is involved

The reading speed can be calculated using the computational intensity calculated in (l) and the applicable read bandwidth for the cache in (g):


Memory read performance when the cache is involved.



Similarly, the writing speed can be calculated using the computational intensity calculated in (m) and the applicable write bandwidth for the cache in (h):


Memory write performance when the cache is involved.



Again, using (b), the peak performance of our algorithm can be predicted as:


Performance when the data is read from the cache.

Performance when the data is written to the cache


In this case, the algorithm is compute bound. While it can access the data quickly, it doesn’t have enough resources to perform as many floating-point operations. Thus, maybe we should consider a faster CPU for our algorithm!


Final words

This is how our roofline model would look like:


Roofline model for our problem: Advanced Programming by the TUM SCCS



This graph clearly lays out what the theoretical limits of our algorithm could be. Being aware of them, we can carefully identify the potential bottlenecks of our algorithm and try to reach the theoretical limits.


Similarly, this also helps us identify when to stop optimizing our code, as maybe our hardware won’t allow further optimizations.


As modern-day processors become more powerful, many algorithms tend to be memory-bound (because of the so-called “Memory Wall”). This is the reason why developers invest so much effort in improving the memory-access patterns.


It is important to mention here that roofline analysis doesn’t always involve a lot of hand calculations. There are a range of open-source and commercial tools that may help us create powerful roofline visualization. These tools include Empirical Roofline Tool (ERT) — for collecting hardware specifiic roofline ceilings and tools like LIKWID, Intel® Advisor Roofline and Intel® VTune™ for collecting application specific performance.


I hope this story helped you understand the basics and complexities of the roofline model!


Source: Thats All Folks Clap GIF by Digital Pratik — Find & Share on GIPHY



Suggested resources

[1] A very short intro to the Roofline model — YouTube

[2] Roofline_MuCoSim-SS22.pdf (fau.de)

[3] Performance model — HPC Wiki (hpc-wiki.info)

[4] RooflineVyNoYellow.pdf (berkeley.edu)

[5] Roofline model (cornell.edu)

[6] Tutorial-ISC2019-Intro-v2.pdf (nersc.gov)

[7] Chair of Scientific Computing (tum.de)

[8] berkeleylab / cs-roofline-toolkit — Bitbucket



Also published here.

Featured Photo by Joshua Hoehne on Unsplash