Performance has always been a high priority for C++, yet there are many examples both in the language and the standard library where compilers produce code that is significantly slower than what a machine is capable of. In this blog post, I’m going to explore one such example from the standard math library. Suppose we’re tasked with computing the square roots of an array of floating point numbers. We might write a function like this to perform the operation: { ( i= ; i<n; ++i) { y[i] = :: (x[i]); } } // sqrt1.cpp # include <cmath> void compute_sqrt1 ( * x, n, * y) const double int double noexcept for int 0 std sqrt If we’re using gcc, we can compile the code with g++ -c -O3 -march=native sqrt1.cpp With , gcc will optimize the code heavily but will still produce code that is standard compliant. The option tells gcc to produce code targeting the native architecture’s instruction set. The resulting binaries may not be portable even between different x86-64 CPUs. -O3 -march=native Now, let’s benchmark the function. We’ll use to measure how long it takes to compute the square roots of 1,000,000 numbers: google benchmark ; { ::mt19937 rng{ }; ::uniform_real_distribution< > dist{ , }; ( i= ; i<n; ++i) { x[i] = dist(rng); } } { n = state.range( ); :: < []> xptr{ [n]}; generate_random_numbers(xptr.get(), n); ( _ : state) { :: < []> yptr{ [n]}; compute_sqrt1(xptr.get(), n, yptr.get()); benchmark::DoNotOptimize(yptr); } } BENCHMARK(BM_Sqrt1)->Arg( ); BENCHMARK_MAIN(); // benchmark.cpp # include <random> # include <memory> # include <benchmark/benchmark.h> void compute_sqrt1 ( * x, n, * y) const double int double noexcept static void generate_random_numbers ( * x, n) double int std 0 std double 0 100 for int 0 static void BM_Sqrt1 (benchmark::State& state) const int 0 std unique_ptr double new double for auto std unique_ptr double new double 1000000 Compiling our benchmark and running we get g++ -O3 -march=native -o benchmark benchmark.cpp sqrt1.o ./benchmark Running ./benchmark Run on (6 X 2600 MHz CPU s) CPU Caches: L1 Data 32 KiB (x6) L1 Instruction 32 KiB (x6) L2 Unified 256 KiB (x6) L3 Unified 9216 KiB (x6) Load Average: 0.17, 0.07, 0.05 ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- BM_Sqrt1/1000000 4984457 ns 4946631 ns 115 Can we do better? Let try this version: { ( i= ; i<n; ++i) { y[i] = :: (x[i]); } } // sqrt2.cpp # include <cmath> void compute_sqrt2 ( * x, n, * y) const double int double noexcept for int 0 std sqrt and compile with g++ -c -O3 -march=native -fno-math-errno sqrt2.cpp The only difference between and is that we added the extra option when compiling. I’ll explain later what does; but for now, I’ll only point out that the produced code is no longer standard compliant. compute_sqrt1 compute_sqrt2 -fno-math-errno -fno-math-errno Let’s benchmark . compute_sqrt2 ... { n = state.range( ); :: < []> xptr{ [n]}; generate_random_numbers(xptr.get(), n); ( _ : state) { :: < []> yptr{ [n]}; compute_sqrt2(xptr.get(), n, yptr.get()); benchmark::DoNotOptimize(yptr); } } BENCHMARK(BM_Sqrt2)->Arg( ); ... // benchmark.cpp static void BM_Sqrt2 (benchmark::State& state) const int 0 std unique_ptr double new double for auto std unique_ptr double new double 1000000 Running g++ -O3 -march=native -o benchmark benchmark.cpp sqrt2.o ./benchmark we get Running ./benchmark Run on (6 X 2600 MHz CPU s) CPU Caches: L1 Data 32 KiB (x6) L1 Instruction 32 KiB (x6) L2 Unified 256 KiB (x6) L3 Unified 9216 KiB (x6) Load Average: 0.17, 0.07, 0.05 ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- BM_Sqrt2/1000000 1195070 ns 1192078 ns 553 Yikes! is than . compute_sqrt2 more than 4 times faster compute_sqrt1 What’s different? Let’s drill down into the assembly to find out. We can produce the assembly for the code by running g++ -S -c -O3 -march=native sqrt1.cpp g++ -S -c -O3 -march=native -fno-math-errno sqrt2.cpp The result will depend on what architecture you’re using, but looking at on my architecture, we see this section sqrt1.s .L3: vmovsd (%rdi), %xmm0 vucomisd %xmm0, %xmm2 vsqrtsd %xmm0, %xmm1, %xmm1 ja .L12 addq $8, %rdi vmovsd %xmm1, (%rdx) addq $8, %rdx cmpq %r12, %rdi jne .L3 Let’s break down the first few instructions: 1: vmovsd ( ), # Load a value memory the register 2: vucomisd , # Compare the value of with the register # EFLAGS with the result 3: vsqrtsd , , # Compute the square root of store 4: ja .L12 # Inspects EFLAGS jumps is above %rdi %xmm0 from into %xmm0 %xmm0 %xmm2 %xmm0 %xmm2 and set %xmm0 %xmm1 %xmm1 %xmm0 and in %xmm1 and if %xmm2 %xmm0 What are instructions 3 and 4 for? Recall that for real numbers, sqrt is undefined on negative values. When std::sqrt is passed a negative number, the C++ standard requires that it return the special floating point value NaN and that it set the global variable errno to EDOM. But that error handling ends up being really expensive. If we look at , we see these instructions for the main loop: sqrt2.s .L6: addl $1, %r8d vsqrtpd (%r10,%rax), %ymm0 vextractf128 $0x1, %ymm0, (%rcx,%rax) vmovups %xmm0, (%rcx,%rax) addq $32, %rax cmpl %r8d, %r11d ja .L6 16 Without the burden of having to do error handling, gcc can produce much faster code. vsqrtpd is what’s known as a Single Instruction Multiple Data (SIMD) instruction. It computes the the square root of four double precision floating point numbers at a time. For computationally expensive functions like sqrt, vectorization helps a lot. It’s unfortunate that the standard requires such error handling. It’s so much slower to do the error checking that many compilers like Intel’s icc and Apple’s default clang-based compiler opt out of the error handling by default. Even if we want std::sqrt do error handling, we can’t portably rely on major compilers to do so. The complete benchmark can be found at rnburn/cmath-bechmark . Previously published at https://ryanburn.com/2020/12/26/why-c-standard-math-functions-are-slow/