For quite some time, I’ve used my own complex number definition in C++. Mostly out of misplaced distrust for std::complex. I have a co-worker who’s done the same. Recently he wanted to move to using the complex number support introduced in C99 for reasons of simplicity.
After awhile he came back and reported that the built in complex numbers were approximately ten times slower than the one he was already using, so he was going to stick with that. He showed me the benchmarks, which I’ve reproduced here. It compares a plain c99 complex float, a std::complex<float> along with my cfloat_t and a trivial complex class that merely defines addition and multiplication. Using this benchmark, we’ll try to pin down exactly the cause of this performance difference.
Compiling with g++ 4.8.5 (same g++ that ships with RHEL7). The results are, indeed, disheartening. Note: from this point on all benchmarks are run on my core i7–2600k. I turned isolcpus=2,3 on during kernel boot and ran benchmarks using a taskset on the isolated CPUs. Benchmarks themselves are 1000 iterations of a 1 million element vector operation ‘vec = const*vec + const*vec + vec’
> g++ -I. test.cc -O3 -o test> taskset -c 2 ./test# c99time stltime mytime cftime<snip># mean: 0.018774 0.018836 0.003292 0.003289# stddev: 2.872722e-04 1.511633e-04 1.248804e-04 1.259614e-04
Ouch, the built-in types perform similarly (expected because GCC’s STL uses C99’s complex type). But they’re almost six times slower than the custom solutions.
We didn’t have an obvious reason or solution, so it was generally agreed he should continue using his custom complex class and move on. It wasn’t until about a month later that the dark truth was revealed.
If you’re like me, sometime in middle school you learned about complex numbers. In particular the rules for multiplying them. You probably learned something similar to the following, which uses the FOIL rule to expand the product out, then simplifies to get back into canonical complex form:
(a+bi) * (c+di) = (ac-bd) + (ad + bc)i
Do you know how C99 defines complex multiplication? Page 470 of the C99 standard (Annex G) provides the answer, I’ll reproduce it here:
<a href="https://medium.com/media/5f6090b67bf61339b10dfbd40b0d104e/href">https://medium.com/media/5f6090b67bf61339b10dfbd40b0d104e/href</a>
That … looks more complex. The goal here seems to be rigorously correct behavior with regards to inf and NaN results. At the very least, there’s a NaN check followed by an extra branch, assuming the first NaN check fails. That and perhaps missed optimization opportunities because of the required checks might explain the 6x slowdown.
So what can be done? After fiddling with some settings we eventually decided to try setting -ffast-math (-Ofast works for gcc ≥ 4.7) in the hopes that forcing finite math, and perhaps re-ordering of operations would give us some win. And lo and behold:
> g++ -I. test.cc -Ofast -o test> taskset -c 2 ./test# c99time stltime mytime cftime<snip># mean: 0.003191 0.003109 0.003184 0.003270# stddev: 4.600097e-05 2.530775e-05 1.377464e-05 9.412087e-06
The performance difference disappears!
After a little more digging, looking at the flags that -ffast-math actually implies, we settled on -fcx-limited-range, which explicitly disables the NaN checking for complex multiply and divide. And sure enough, that’s sufficient to recover the performance:
> g++ -I. test.cc -O3 -fcx-limited-range -o test> taskset -c 2 ./test# c99time stltime mytime cftime<snip># mean: 0.003269 0.003328 0.003277 0.003287# stddev: 4.911105e-05 2.076989e-05 1.038577e-05 1.047159e-05
Finally, I wanted to make sure there weren’t any other regressions, so I installed a range of gcc compilers on my machine, and ran the benchmark for each (with -O3 and -fcx-limited-range). The results were as follows:
<a href="https://medium.com/media/320e780c29ff5da86963e76084c8f76e/href">https://medium.com/media/320e780c29ff5da86963e76084c8f76e/href</a>
These look good, everything is comparable performance-wise, when we hit GCC5, however, the vectorizer finally gets the loop and gives us ~33% boost in performance, but no regressions across complex number types.
Except, what’s that hiding in the corner of g++4.4’s results? Did it… actually make my two custom classes slower than the built-ins? $#%$*, It did and not a little slower, either, 3.5x slower.
Well, let’s look at a little assembly from the innerloop for the c99 complex and my cfloat:
<a href="https://medium.com/media/fa26e54447525a5569a48e6833d68910/href">https://medium.com/media/fa26e54447525a5569a48e6833d68910/href</a>
I don’t know what’s really going on here, other than it’s clear the compiler failed to optimize the cfloat.h code very well, which causes very poor performance.
The takeway from all of this is that:
And, finally, don’t think newer compilers will save you either, with g++7 without -fcx-limited-range:
> g++ -I. test.cc -O3 -o test> taskset -c 2 ./test# c99time stltime mytime cftime<snip># mean: 0.017414 0.017436 0.002254 0.002277# stddev: 2.660802e-04 1.467618e-04 2.559226e-05 2.129554e-05
Here, g++ happily vectorized the custom structs, while abandoning the built-ins to their default performance. This time it’s an almost 8x difference in speed, even worse than what we started with g++4.4.7. Trust nothing, and stay frosty.