*This was originally a lecture from **Materialise Academy C++** . Since we don’t sell education, we benefit from it, it is only logical to share our teaching materials with the rest of the world.*

This particular piece is about floating point numbers. It is not about the theory behind it, you can read that on Wikipedia just fine, but mostly about pitfalls and dark corners. It might seem negatively written to you, but it is not made out of despair, but rather out of respect for complexity and all the challenge it brings.

If you work with floating point numbers long enough, you will inevitably develop a desire to run away and live a simpler life somewhere in the woods. It’s not that they are conceptually complex, quite the opposite. They look simple; they look reasonably manageable. That’s until you realize that basically every property you used to rely on with real numbers doesn’t stick to these fakes.

The variety of unpleasant inconsistencies is therefore staggering. If you want your computation to be unconditionally precise, then you can’t rely on basic algebra, you can’t rely on compiler optimizations, you can’t even rely on static computation. Sure, it is still fine most of the time to have a little error here and there, but if you work with floating point numbers long enough, you will realize that most of the time your problem is not what happens “most of the time”.

So why aren’t floating point numbers simple? Were they deliberately made unreliable? Well, the thing is encoding real numbers in binary is not only non-trivial but impossible. Numbers are infinite; bits are not. There is no and there can not be a bijection between anything real and anything digital, so there is no isomorphism between a real number algebraic system and computational algorithms either. We can only afford a model, a pragmatic instrument not far away from the abacus or the counting sticks. For all the odds, the guys who developed the computable floating point numbers did a very good job.

Let’s put ourselves in their shoes, shall we?

Consider the “one, two, many” counting system. It only has two numbers: one and two. Everything beyond that is many. It is the simplest counting system known to men, yet primitive societies such as Walpiri from Australia or Pirahã from South America are quite happy with it.

It has a very simple addition table. Almost everything becomes “many” when it is added up. But what about subtraction?

Subtraction is not obvious at all. We don’t have zero in “one, two, many”. In fact, until the 11th century, we didn’t have zero in Europe either. And even when being adopted, it served rather as a filler symbol for the positioning system rather than a separate concept.

It’s just unnatural to have zero as a number. When you say that you have a number of apples you just imply that the number isn’t zero. Even mathematics’ natural numbers don’t include zero unless explicitly specified.

You can’t blame the Walpiri for not wishing to mess with this murky concept. So `1 - 1`

is not a number in “one, two, many”. In the spirit of modern software jargon, let’s call it a `NaN`

.

Now `1 – 2 `

is not a number either. Negative numbers are even more sinister for language. You can say “I have zero apples” meaning that you have none, but you can’t say “I have minus four apples,” although “It is four degrees outside” is perfectly fine. Well, the Pirahã don’t know about negative temperatures either, so `1 – 2`

is `NaN`

.

`“Many” - “many”`

can be anything. It may be a number, or “many”, or `NaN`

. Let us consider it undefined and move on.

But let’s not hurry to consider `“many” - 1 `

to be undefined as well.

The counting system is not a theoretical formalism. It is an instrument that serves people. And sometimes, just to be useful, an instrument has to be a little curved.

Like with `“many” - 1`

. A shepherd may say: “When I take a sheep from my herd, and then I take another, rather sooner than later there will be only two left.” But the rice farmer would object: “when I take a single grain from a pile, there is more often than not still a pile.”

To resolve disputes such as this, people invented standardization. Standards may not always be right. They may not always be optimal. But it is still better to have one solution that works for everyone fairly well rather than countless finished solutions that barely work well even for their inventors.

When standardizing, we always keep disputable questions off the table, even if this means compromise and sub-optimal solutions.

Speaking of which, floating point numbers are standardized under IEEE 754. It may very well be suboptimal from time to time, but it’s universal and it’s a huge thing! Yes, there are plenty of WAT-stories about JavaScript and floating point numbers, but these are the same WATs that C or Java folk know and like. It is consistently insane all over the world, and that is bliss!

They pretend to be real, but in fact, they only cover very specific grids of rational numbers. Since the nature of machine computation is always discrete, you can even imagine them as integers. A floating point number is an *integer number or halves, quarters, eights* or whatever measure the exponent shows.

Unlike common integers, they do handle overflows and underflows rather well. While an integer overflow is often not part of a language or even hardware architecture, floating point overflows are specified by the standard. There are two instances of infinities: positive and negative, and two instances of zeros: also positive and negative. There is no real zero however, just like there is no real 1/3. Both “zeros” are just a model of something smaller than the smallest floating point number and not necessary a naught.

Also unlike integers, floating point numbers have a special value for denoting “not a number”. You can get one by dividing by zero or by trying to take the square root of a negative number. In fact, you can even use it in your own computations deliberately to mark an error.

There are some notes in small print here. To keep it short, overflows are computable; there are numbers without implicit leading digits; there are two kinds of NaNs and each one is represented by the whole class of numbers. We’ll get to some of that further on. For now let’s study some examples.

As you might already know, a floating point number consists of a sign bit, several bits for an exponent, and the rest is called a significand — it’s where all the digits are kept.

Zero is actually rather simple. Well, apart from that is has a sign. But look at the “one”. Isn’t it sinister? How come its significand is empty?

Consider this. There is a normal scientific notation which involves decimal numbers, a point, and an exponent. In this notation, you write “`123`

” as “`1.23e2`

”. You don’t write “`12.3e1`

” and certainly not “`0.123e4`

”. Now if all your possible digits are ones and zeros, then you might not bother writing the trailing one at all. It will always be there.

Unless the number is `0`

.

If you do have more than one significant digit though, they will, of course, all but one appear in the significand part of a floating point number.

A special value of the exponent is reserved for oddities. Full ones with full zeros in the significant are for overflows, and the rest are “not-a-numbers”. Note that you usually get only one type regardless of the operation. But be aware, there might be other NaNs, and some of them may even have context specific meaning.

Now let’s see how big a single precision floating point range is.

Yes, there are exactly 128 different exponent values for values greater or equal to`1`

. So you can multiply `1`

by `2`

127 times until you hit an overflow. You might expect the same range for the division. But it’s actually even bigger.

If you remember correctly, the trailing `1`

is implied unless the number is a `0`

. But this means that the whole range of 23 perfectly fine bits is reserved exclusively for only one, well two, distinct numbers. Isn’t it a bit extravagant?

When we know that for one particular value of an exponent the trailing one is not implied, that we can write is explicitly in the significand. Yes, this means that every next number in our division exercise will have less meaningful digits than the former, but this still extends a range of how small a number we can denote.

It is a brilliant idea. But not really helpful most of the time. In reality, it only makes the computation harder, so there are usually ways of turning the whole thing off.

Now let’s take a look at an entirely fictional example, I try to present as a typical computational problem.

Let’s say we want to find a solution for this simple equation by brute force. We could traverse through a range of `x`

with some step named “*delta*” and find an `x`

that is close enough to the solution meaning` x² — 5 < epsilon`

. The “*epsilon*” would be some small enough error we let out algorithm have.

There is a huge problem with this approach. Usually, it is easy to pick a *delta *— it’s some small value that represents negligible imprecision of the computation.

But how do we pick *epsilon *then? Obviously, it depends on a *delta*. If it’s too small, then there is a possibility that no suitable `x`

can be found. If it’s too large then the algorithm will find an array of suitable `x`

and the error of the computation will become greater than *delta*.

Even more unpleasant, it depends on the computational task itself. We only have to restate it as `x²=y`

and the task of finding a reasonable *epsilon *becomes impossible. For every given *delta *and *epsilon *there will always be a `y`

where an *epsilon *would be too small for a *delta*.

Of course, the example is rather silly. But the same implication works in practice. If your computation has epsilons and it works just fine, you just probably haven’t found a corner case where it wouldn’t. Yet.

Now the second approach would let us solve it without epsilon at all. Let’s find a solution as an interval `x..x+delta`

on which `x²-y`

makes a sign change.

Well, this is better because we wouldn’t have to solve a theoretically unsolvable problem to come up with an algorithm. But it is still vulnerable.

Of course, it wouldn’t work for `y=0`

.

And of course, it gets worse in practice when the algorithms are more complex and the corner cases are less predictable.

Let’s say we have a 3D plane and a point on it. The plane will be as simple as `z = 0`

, and the point then: `(1, 0, 0)`

. Simple! But all it takes to make it complicated is a mere transformation. A simple 30-degree rotation around the `y`

axis and the point becomes `(√3/2, 0, 1/2)`

. Except we don’t have irrational numbers among floating point numbers, so it actually doesn’t become exactly that, but something “close enough”. Does it still belong to the plane? Is it under or over the plane now? If we take it as a vector, what will be its length?

In real numbers all those questions would have trivial answers. But we don’t have real numbers on a computer, we have approximations instead, and these approximations always turn certain things into uncertainty. The moral here is, you can’t just run away from complexity unless you agree to literally live in the woods. You have to embrace the complexity and know your tools better than your competitors.

It’s not as hard as it seems. General knowledge about floating point numbers is half folklore, half bitter experience.

Speaking of which, here is some of mine bitter experience.

At Materialise we take regression testing very seriously. And floating points bring a world of pain for us now and then. As they don’t hold an associativity, you simply can’t rely on the deterministic result from every parallel computation for which an order of operations is not explicitly defined. Of course, the deviation caused by that is often negligible, but if you comply with FDA requirements, you can’t just let your tests be “a little bit failed”.

The other thing is the absence of a fair zero. You might be surprised how popular minus zero is as a number. Make an ASCII exporter, gather some data, write some tests, roll it all to the client — and be assured in a month or so you’ll get a red report saying that the expected value for “a parameter no one cares about” is `0`

not `-0`

.

And of course, my favorite range of bugs comes from a simple fact that floating point numbers and integer numbers are not interchangeable. A template method that is supposed to work both with `long int`

and `double`

will break down constantly on both integer and floating point corner cases. An attempt to fix it while keeping the code parametric usually brings a few pleasant minutes of nervous laughter for whoever gets to meet it the next time it fails.

But this is just a sample. A small introduction to the unreliable properties. The sad fact is, algebraic properties have a great potential and it’s a shame to miss it completely. Different compilers exploit those properties in different ways and under different compiler flags. Agner Fog did a great job by combining it all in one place. I want you to see a small quotation from it.

As you might see, and as you might guess, not all of these properties are indeed reliable, and not all of them are necessarily exploited. In reality, this means that the result of computation relies on the particular path of compilation. You can get a compiler update and a bunch of failed tests as a free supplementary.

We once got this when the new compiler learned how to do static computation on trigonometric functions. Previously they were all computed in the run time using SSE2 registers that hold two fair doubles. But for static computations x86 extended precision 80-bit format was chosen, and that makes it just a little bit more precise in some small amount of cases.

Again, while the error may be negligible, the very fact that the results change without any good reason — usually not. At the very least it makes people nervous.

To sum things up — floating points are unreliable, unpredictable, counterintuitive, fast, compact and standardized. You can’t treat them lightly, but you can sure do amazing things with them with a bit of knowledge, experience, and respect.

At the end, I want to propose an exercise. Something you are better off doing in a laboratory rather than on the job. I suggest you break every algebraic compiler optimization you can. Not every one of them is, of course, breakable, some are quite solid. But learning which is which with your own hands might be a good lesson.

For instance, you might think that for every floating point number `x = x`

. Here is the code that shows that it isn’t true. Can you guess what `x`

is? Well, even if you can, the beauty of it is — you don’t have to! Just run the code and see.

**P. S. **Some might say that it is all hardly worth the effort and you would be much better doing integer computations. Integers are predictable, aren’t they?

Well, no. Just the day before this lecture was finished, I found an integer overflow in our code. This is not a real piece, but just an example:

As you can see, the code looks perfectly legitimate. But of course, it isn’t. The expression after “return” is being computed in integer and only then converted to the double. In C++ there are pages of rules for integer promotions, conversions, and ranks, but there is no standard way to handle integer overflows.

Integer number computations, along with corresponding conversion rules, are way worse than floating point computations because while they are certainly counterintuitive, they are not even standardized. There is no international standard on integers, so the implementation differs from language to language, from machine to machine. You can detect integer overflow for free on 80386, but not on ARM7. There is simply no such flag.

There is a whole chapter on integer computation in Secure Coding in C and C++. Although it is a bit out of the scope of this lecture, I recommend reading it if only to debunk the myth that integer numbers are simple.