Unlocking peak performance from your C++ code can be daunting, demanding meticulous profiling, intricate memory access adjustments, and cache optimization. Is there a trick to simplify this a bit?? Fortunately, there is a shortcut to achieving remarkable performance gains with minimal effort — provided you have the right insights and know what you’re doing. Enter compiler optimizations that can significantly elevate your code’s performance. Modern compilers serve as indispensable allies in this journey toward optimal performance, particularly in automatic parallelization. These sophisticated tools possess the prowess to scrutinize intricate code patterns, especially within loops, and execute optimizations seamlessly. This article aims to spotlight the potency of compiler optimizations, focusing on the — renowned for their popularity and widespread usage. Intel C++ compilers In this story, we unravel the layers of compiler magic that can transform your code into a high-performance masterpiece, requiring less manual intervention than you might think. What are compiler optimizations? | -On | Architecture targeted | Interprocedural Optimization | -fno-aliasing | Compiler Optimization reports Highlights: What are compiler optimizations? Compiler optimizations encompass various techniques and transformations a compiler applies to the source code during compilation. But why? To enhance the performance, efficiency, and, in some instances, the size of the resulting machine code. These optimizations are pivotal in influencing various aspects of code execution, including speed, memory usage, and energy consumption. Any compiler executes a series of steps for converting the high-level source code to the low-level machine code. These involve lexical analysis, syntax analysis, semantic analysis, intermediate code generation (or IR), optimization, and code generation. During the optimization phase, the compiler meticulously seeks ways to transform a program, aiming for a semantically equivalent output that utilizes fewer resources or executes more rapidly. Techniques employed in this process encompass but are not limited to . constant folding, loop optimization, function inlining, and dead code elimination I’m not going to discuss all the available options, but how we can instruct the compiler to do specific optimization that might improve code performance. So, the solution???? Compiler Flags. Developers can specify a set of compiler flags during the compilation process, a practice familiar to those using options like “ or with GCC for debugging and profiling information. As we go ahead, we’ll discuss similar compiler flags we can use while compiling our application with the Intel C++ compiler. These might help you improve your code’s efficiency and performance. -g” “-pg” So, what are we working with? I won’t delve into dry theory or inundate you with tedious documentation listing every compiler flag. Instead, let’s try to understand why and how these flags work. How do we accomplish this??? We’ll take an unoptimized C++ function responsible for calculating a iteration, and step by step, we’ll unravel the impact of each compiler flag. Along this exploration, we’ll measure the speedup by systematically comparing each iteration with the base version — starting with no optimization flags (-O0). Jacobi The speedups (or execution time) were measured on an machine. Here, the Jacobi method solves a 2D partial differential equation (Poisson equation) for modeling the heat distribution on a rectangular grid. Intel® Xeon® Platinum 8174 Processor u(x,y,t) is the temperature at point (x,y) at time t. We solve the stable state when the distribution isn’t changing anymore: A set of Dirichlet boundary conditions have been applied at the boundary. We essentially have a C++ coding performing the Jacobi iterations on grids of variable sizes (that we call resolutions). Basically, a grid size of 500 means solving a matrix of size 500x500, and so on. The function for performing one Jacobi iteration is as follows: /*
 * One Jacobi iteration step
 */
void jacobi(double *u, double *unew, unsigned sizex, unsigned sizey) {
 int i, j;

 for (j = 1; j < sizex - 1; j++) {
  for (i = 1; i < sizey - 1; i++) {
   unew[i * sizex + j] = 0.25 * (u[i * sizex + (j - 1)] +  // left
      u[i * sizex + (j + 1)] +  // right
      u[(i - 1) * sizex + j] +  // top
      u[(i + 1) * sizex + j]); // bottom
  }
 }

 for (j = 1; j < sizex - 1; j++) {
  for (i = 1; i < sizey - 1; i++) {
   u[i * sizex + j] = unew[i * sizex + j];
  }
 }
} We keep performing Jacobi iteration until the residual reaches a threshold value (inside a loop). The residual calculation and threshold evaluation are done outside this function and are not of concern here. So, let’s talk about the elephant in the room now! How does the base code perform? With no optimizations (-O0), we get the following results: Here, we measure the performance in terms of the MFLOP/s. This will be the basis of our comparison. MFLOP/s stands for “Million Floating Point Operations Per Second.” It is a unit of measurement used to quantify the performance of a computer or processor in terms of floating-point operations. Floating-point operations involve mathematical calculations with decimal or real numbers represented in a floating-point format. MFLOP/s is often used as a benchmark or performance metric, especially in scientific and engineering applications where complex mathematical calculations are prevalent. The higher the MFLOP/s value, the faster the system or processor performs floating-point operations. To provide a stable result, I run the executable 5 times for each resolution and take the average value of the MFLOP/s values. Note 1: It’s important to note that the default optimization on Intel C++ compiler is -O2. So, it is important to specify -O0 while compiling the source code. Note 2: Let’s go ahead and see how these run times will vary as we try different compiler flags! The most common ones: -O1,-O2,-O3 and -Ofast These are some of the most commonly used compiler flags when one begins with compiler optimizations. In an ideal case, the performance of . However, this doesn’t necessarily happen. The critical points of these options are as follows: Ofast > O3 > O2 > O1 > O0 -O1: Optimize for speed while avoiding code size increase. Goal: Suitable for applications with large code sizes, many branches, and where execution time isn’t dominated by code within loops. Key Features: -O2: Enhancements over -O1: Enables vectorization. Allows inlining of intrinsics and intra-file interprocedural optimization. -O3: Enhancements over -O2: Enables more aggressive loop transformations (Fusion, Block-Unroll-and-Jam). Optimizations may only consistently outperform -O2 if loop and memory access transformations occur. It can even slow down the code. Recommended For: Applications with loop-heavy floating-point calculations and large data sets. -Ofast: Sets the following flags: “-O3” “- : It enables optimizations that give fast and slightly less precise results than full IEEE division. For example, A/B is computed as A * (1/B) to improve the computation speed. no-prec-div” “ : enables more aggressive floating-point optimizations. -fp-model fast=2" The talks in detail about exactly which optimizations these options offer. official guide When using these options on our Jacobi code, we obtain these execution run times: It is clearly evident that all these optimizations are much faster than our base code (with “-O0”). The execution run time is 2–3x lower than the base case. What about MFLOP/s?? Well, that’s something!!! There is a big difference between the MFLOP/s of the base case and those with the optimization. Overall, though only slightly, “-O3” performs the best. The extra flags used by “- ” (“ ”) aren’t giving any additional speedup. Ofast -no-prec-div -fp-model fast=2 Architecture targeted (-xHost,-xCORE-AVX512) The machine’s architecture stands out as a pivotal factor influencing compiler optimizations. It can significantly enhance performance when the compiler knows the available instruction sets and the optimizations supported by the hardware (like vectorization and SIMD). For instance, my Skylake machine has 3 SIMD units: 1 AVX 512 and 2 AVX-2 units. Can I really do something with this knowledge??? The answer lies in strategic compiler flags. Experimenting with options such as “ ” and, more precisely, “ ” may allow us to harness the full potential of the machine’s capabilities and tailor optimizations for optimal performance. -xHost -xCORE-AVX512 Here is a quick description of what these flags are all about: -xHost: Specifies that the compiler should generate code optimized for the host machine’s highest instruction set. Goal: Takes advantage of the latest features and capabilities available on the hardware. It can give an amazing speedup on the target system. Key Features: While this flag optimizes for the host architecture, it might result in binaries that are not portable across different machines with varying instruction set architectures. Considerations: -xCORE-AVX512: Explicitly instruct the compiler to generate code that utilizes the Intel Advanced Vector Extensions 512 (AVX-512) instruction set. Goal: AVX-512 is an advanced SIMD (Single Instruction, Multiple Data) instruction set that offers wider vector registers and additional operations compared to previous versions like AVX2. Enabling this flag allows the compiler to leverage these advanced features for optimized performance. Key Features: Portability is again the culprit here. The binaries generated with AVX-512 instructions may not run optimally on processors that do not support this instruction set. They may not work at all! Considerations: AVX-512 set instructions use Zmm registers, which are a set of 512-bit wide registers. These registers serve as the foundation for vector processing. By default, “ ” assumes that the program will unlikely benefit from zmm registers usage. The compiler avoids using zmm registers unless a performance gain is guaranteed. -xCORE-AVX512 If one plans to use the zmm registers without restrictions, “ ” can be set to high. That’s what we’ll be doing as well. -qopt-zmm-usage Don’t forget to check the for detailed instructions. official guide Let’s see how these flags work for our code: Woohoo! We now cross the 1200 MFLOP/s mark for the smallest resolution. The MFLOP/s values for other resolutions have also increased. The remarkable part is that we achieved these results without any substantial manual interventions — simply by incorporating a handful of compiler flags during the application compilation process. However, it is essential to highlight that the compiled executable will only be compatible with a machine using the same instruction set. The optimization-versus-portability trade-off is evident, as code optimized for a particular instruction set may sacrifice portability across different hardware configurations. So, make sure you know what you’re doing!! Don’t worry if your hardware doesn’t support AVX-512. Intel C++ Compiler supports optimizations for AVX, AVX-2 and even SSE. The has everything you need to know! Note: documentation Interprocedural Optimization (IPO) Interprocedural Optimization involves analyzing and transforming code across multiple functions or procedures, looking beyond the scope of individual functions. IPO is a multi-step process focusing on the interactions between different functions or procedures within a program. IPO can include many different kinds of optimizations, including Forward substitution, Indirect call conversion, and Inlining. Intel Compiler supports two common types of IPO: Single-file compilation and multi-file compilation (Whole Program Optimization) [ ]. There are two common compiler flags performing each of them: 3 -ipo: Enables interprocedural optimization, allowing the compiler to analyze and optimize the entire program, beyond individual source files, during compilation. Goal: Whole Program Optimization: “ ” performs analysis and optimization across all source files, considering the interactions between functions and procedures throughout the entire program.- Cross-function and cross-module optimization: The flag facilitates inlining functions, synchronization of optimizations, and data flow analysis across different program parts. Key Features:- -ipo It requires a separate link step. After compiling with “ ”, a particular link step is needed to generate the final executable. The compiler performs additional optimizations based on the whole program view during linking. Considerations: -ipo -ip: Enables interprocedural analysis-propagation, allowing the compiler to perform some interprocedural optimizations without requiring a separate link step. Goal: Analysis and propagation: “ ” enables the compiler to perform research and data propagation across different functions and modules during compilation. However, it does not perform all optimizations that require the full program view.- Faster compilation: Unlike “ ”, “ ” doesn’t necessitate a separate linking step, resulting in speedier compilation times. This can be beneficial during development when quick feedback is essential. Key Features:- -ip -ipo -ip Only some limited interprocedural optimizations occur, including function inlining. Considerations: -ipo generally provides more extensive interprocedural optimization capabilities as it involves a separate link step but comes at the cost of longer compilation times. [ ] 4 -ip is a quicker alternative that performs some interprocedural optimizations without requiring a separate link step, making it suitable for development and testing phases.[ ] 5 Since we’re only talking about performance and different optimizations, compile times, or size of the executable not being our concern, we’ll focus on “ ”. -ipo -fno-alias All the above optimizations depend on how well you know your hardware and how much you would experiment. But that’s not all. If we try to identify how the compiler would see our code, we may identify other potential optimizations. Let’s again have a look at our code: /*
 * One Jacobi iteration step
 */
void jacobi(double *u, double *unew, unsigned sizex, unsigned sizey) {
 int i, j;

 for (j = 1; j < sizex - 1; j++) {
  for (i = 1; i < sizey - 1; i++) {
   unew[i * sizex + j] = 0.25 * (u[i * sizex + (j - 1)] +  // left
      u[i * sizex + (j + 1)] +  // right
      u[(i - 1) * sizex + j] +  // top
      u[(i + 1) * sizex + j]); // bottom
  }
 }

 for (j = 1; j < sizex - 1; j++) {
  for (i = 1; i < sizey - 1; i++) {
   u[i * sizex + j] = unew[i * sizex + j];
  }
 }
} jacobi() function takes a couple of pointers to double as parameters and then does something inside the nested for loops. When any compiler sees this function in the source file, it has to be very careful. Why?? The expression to calculate using involves the average of 4 neighboring values. What if both and point to the same location? This would become the classical problem of [ ]. unew u u u unew aliased pointers 7 Modern compilers are very smart and to ensure safety, they assume that aliasing could be possible. And for scenarios like this, they avoid any optimizations that may impact the semantics and the output of the code. In our case, we know that and are different memory locations and are meant to store different values. So, we can easily let the compiler know there won’t be any aliasing here. u unew How do we do that? There are two methods. First is the C “ . But it requires changing the code. We don’t want that for now. ” keyword restrict Anything simple? Let’s try “ ”. -fno-alias -fno-alias: Instruct the compiler to not assume aliasing in the program. Goal: Assuming no aliasing, the compiler can more freely optimize the code, potentially improving the performance. Key Features: The developer has to be careful in using this flag as in case of any unwarranted aliasing, the program may give unexpected outputs. Considerations: More details can be found in the . official documentation How does this perform for our code? Well, now we have something!!! We’ve achieved a remarkable speedup here, nearly 3x of the previous optimizations. What’s the secret behind this boost? By instructing the compiler not to assume aliasing, we’ve given it the freedom to unleash powerful loop optimizations. A closer examination of the assembly code (though not shared here) and the generated compile optimization report (see ) reveals the compiler’s savvy application of and . These transformations contribute to a highly optimized performance, showcasing the significant impact of compiler directives on code efficiency. below loop interchange loop unrolling Final graphs This is how all the optimizations perform against each other: Compiler Optimization report (-qopt-report) The Intel C++ compiler provides a valuable feature that allows users to generate an optimization report summarizing all the adjustments made for optimization purposes [ ]. This comprehensive report is saved in the YAML file format, presenting a detailed list of optimizations applied by the compiler within the code. For a detailed description, see the official documentation on “ ”. 8 -qopt-report What next? We discussed a handful of compiler flags that can drastically improve the performance of our code without us actually doing much. The only prerequisite: don’t do anything blindly; make sure you know what you’re doing!! There are hundreds of such compiler flags, and this story talks about a handful. So, it is worth looking at your preferred compiler’s official compiler guide (especially the documentation related to optimization). Apart from these compiler flags, there are a whole bunch of techniques like Vectorization, SIMD intrinsics, , and , which can amazingly improve the performance of your code. Profile Guided Optimization Guided Auto Parallelism Similarly, Intel C++ compilers (and all the popular ones) also support pragma directives, which are very nice features. It’s worth checking some of the pragmas like etc., on the . ivdep, parallel, simd, vector, Intel-Specific Pragma Reference Suggested reads [1] Optimization and Programming (intel.com) [2] High Performance Computing with “Elwetritsch” at the University of Kaiserslautern-Landau (rptu.de) [3] Interprocedural Optimization (intel.com) [4] ipo, Qipo (intel.com) [5] ip, Qip (intel.com) [6] Intel Compiler, Optimization and Other flags for use by SPEChpc [7] Aliasing — IBM Documentation [8] Intel® Compiler Optimization Reports Featured Photo by on . Igor Omilaev Unsplash Also published . here

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

Compiler Optimizations: Boosting Code Performance With Minimum Tweaks!

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Deep Dive Into Amdahl’s Law and Gustafson’s Law

10 Most Popular Programming Languages | 2022

46 Stories To Learn About C

6 Best C# Programming Books Ranked by Review Score

6 Best C Programming Books Ranked by Review Scores

7 C# Tips for Beginners in Programming Language

A Deep Dive Into Amdahl’s Law and Gustafson’s Law

10 Most Popular Programming Languages | 2022

46 Stories To Learn About C

6 Best C# Programming Books Ranked by Review Score

6 Best C Programming Books Ranked by Review Scores

7 C# Tips for Beginners in Programming Language

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps