You probably have heard of the idea of out-of-order execution, and that it breaks some common-sense assumptions about writing programs, and software has to insert barriers to make things right. That was a baffling concept to me at least when I heard of it. At the very least, What's the point? why would the hardware want to execute out of order only to have the software correct the behavior back? Is the software taking care of all the quirks for me? And if not, where's the catch? Well, all this conundrum comes from the evolution of processor design to create parallelism and the software efforts to harness the power while reining in the unwanted impact. Let's start with a tiny bit of history. Sequential consistency on uniprocessor Going back to the good old days when your CPU was just single-issue (meaning executing one instruction at a time), and there's no multi-threading programming to speak of, the programming model is straight forward -- there's memory and a stream of instructions executing in program order, doing reading/writing to the memory. For as long as the program lives, memory will reliably hold results written to it, and reproduce the same values when read afterwards. That is, a read after a write to the same memory location will faithfully return the value written. A write after a write to the same memory location will overwrite the old value with the new. To illustrate, look at this simplistic : example1 { i = ; :: << i << ; j = ; :: << i << ; :: << j << ; i = ; :: << i << ; :: << j << ; ; } 1 # include <iostream> 2 3 int main () 4 int 1 5 std cout ' ' 6 int 5 7 std cout ' ' 8 std cout ' ' 9 2 10 std cout ' ' 11 std cout '\n' 12 return 0 13 Easy to predict, the result will be . At first variable assumes the value . Up until , where the value's updated, no matter how many times it's read the location will always return the same value. Right after 2 is written to it, reading that variable will invariably return . 1 1 5 2 5 i 1 line 9 2 There's also the variable . Since it's a different location, what value is written to it doesn't concern the output of whatsoever. j i Now let's look at a 2-threaded example to add a bit of nondeterminism. Keep in mind we're still executing on a single CPU. Imagine we have two functions and executed by two separate threads, in the following : func0 func1 example2 i = ; j = ; output[ ]; { i = ; output[ ] = j; } { j = ; output[ ] = i; } int 0 int 0 int 2 void func0 () 1 1 void func1 () 1 0 After the program finishes what are the valid outputs? There are roughly four instructions here. Two reads and two writes. Depending on how they are mixed, and are both likely -- if one function happens to execute before another then these will ensue. is possible too, so long as the reads are both performed after the writes are done. 01 10 11 You probably have noticed is not possible. No matter how you execute, the first instruction will be writing a to either of the output elements. 00 1 When you make this conclusion, you are of course intuitively making assumptions that the programming model is . To summarize, this means that even though the instructions from different threads may be interleaved in a nondeterministic way, each thread's execution is in program order. Also the operations are not overlapped. One memory access is done before another begins. sequential consistency Over the years, processor design has improved drastically, and computer architects play all kinds of tricks to extract parallelism from executions. For example, here in our example you may notice that in , the two operations can be performed in parallel, so long as they both execute before . Given a possible instruction sequence, we'll be free to reorder the instructions to maintain the impression that the program still runs with sequential consistency, as long as we maintain the order of memory operations to the same variable! func0 func1 Computer architects actually went ahead and took advantage of this to implement that can execute multiple instructions at a time. They tracked the operation dependencies using algorithms like and . In the history of processor you will see the architects playing nasty and nastier tricks, such as speculative execution, which caused a ton of now in the cloud business. superscalar processors Scoreboarding Tomasulo trouble But I have digressed here. The point is no matter how the system evolves, has been the contract between programmers and system builders. sequential consistency de facto Consistency model on multi-processors In the early to mid 2000's, the idea of SMP (symmetric multiprocessing) really caught on. The reason was that computer architects found it more and more difficult to dissipate the amount of heat generated by a single powerful processor. Using the same die area, they'd rather build multiple smaller "cores". However, like the uniprocessor days, in each core, instructions may still be executed out-of-order. The read-write dependencies are still tracked by the same algorithms so as to maintain the illusion that it's as if the program was still executed in-order. However, this dependency tracking won't be available across cores. Now looking back on our : example2 i = ; j = ; output[ ]; { i = ; output[ ] = j; } { j = ; output[ ] = i; } int 0 int 0 int 2 void func0 () 1 1 void func1 () 1 0 Suppose the two functions are executed on two separate processors, processor-0 will be free to think is all it's executing. So it's possible for this processor to read the and write the output first. Processor-1 is in the same boat! So under this paradigm, the output is actually possible! func0 j 00 To illustrate this, let's write a full-fledged program that can reproduce this: i; j; output[ ]; ::atomic< > ready; { (!ready.load()) { } i = ; output[ ] = j; } { (!ready.load()) { } j = ; output[ ] = i; } { i = ; j = ; ready.store( ); :: ; :: ; ::this_thread::sleep_for( ::chrono::milliseconds( )); ready.store( ); t0.join(); t1.join(); :: << output[ ] << output[ ] << :: ; } { (;;) run_test(); } # include <atomic> # include <iostream> # include <thread> int int int 2 std bool void func0 () while 1 1 void func1 () while 1 0 void run_test () 0 0 false std thread t0 (func0) std thread t1 (func1) std std 5 true std cout 0 1 std endl int main () for Don't mind the boiler plates for now. The sleep in and the atomic are to make the two threads start roughly at the same time. run_test ready On my x86 processor, is not only possible, it's the dominant output (perhaps because reading is cheaper than writing with cache coherency protocols, which we will cover in later posts). In fact and combined only account for about 25% of the output. 00 01 10 For some reason is really rare, next to nonexistent, but that is just because of the timing of the execution. If you tell both threads to yield between the read and the write, then you will see most of the time. 11 11 Also, to illustrate this won't be a problem on uniprocessors, we can pin the two threads onto the same core by adding the following snippet right after you create the threads. Afterwards you will see output completely disappear. 00 cpuset; CPU_ZERO(&cpuset); CPU_SET( , &cpuset); pthread_setaffinity_np(t0.native_handle(), ( ), &cpuset); pthread_setaffinity_np(t1.native_handle(), ( ), &cpuset); cpu_set_t 0 sizeof cpu_set_t sizeof cpu_set_t How do we fix the broken consistency I don't know about you, but when I first saw this result my mind was blown. I knew some RISC processors can go aggressive on relaxing memory models, but I had always thought my old buddy x86 is a dependable fellow. X86 in fact, implements a memory model called TSO, or total store ordering. As the name suggests, the stores (writes) to all locations are globally ordered. More on the relaxed memory model later, but the tl;dr here is that you can imagine TSO places write results in a buffer called LSQ, or load-store-queue, before they go out to the cache. Each read afterwards will in turn check this queue for an early hit, before they themselves go out to the cache. The reason is that cache access takes much longer than register access, and it's valuable to keep a buffering structure on-core. But of course the consequence is that your read may bypass previous writes to different locations, as long as you can find a result in the LSQ, sometimes breaking sequential consistency. The fix is to enforce sequential consistency by removing the reordering. C++ gives you a standard way of doing this. Instead of declaring and as plain integers, we'll wrap them around a type, and the reads and writes will be through the and methods. i j std::atomic load() store() By default these methods will , even when these variables are accessed across different processors. So our new program will look like this: enforce sequential consistency ::atomic< > i; ::atomic< > j; output[ ]; ::atomic< > ready; { (!ready.load()) { } i.store( ); output[ ] = j.load(); } { (!ready.load()) { } j.store( ); output[ ] = i.load(); } { i.store( ); j.store( ); ready.store( ); :: ; :: ; ::this_thread::sleep_for( ::chrono::milliseconds( )); ready.store( ); t0.join(); t1.join(); :: << output[ ] << output[ ] << :: ; } { (;;) run_test(); } # include <atomic> # include <iostream> # include <thread> std int std int int 2 std bool void func0 () while 1 1 void func1 () while 1 0 void run_test () 0 0 false std thread t0 (func0) std thread t1 (func1) std std 5 true std cout 0 1 std endl int main () for As promised before, we have now converted and into atomics. And voilà, the pattern is eliminated even if we're running on 2 cores, fully bringing us to the sequential consistency land! i j 00 Underlying the hood, the compiler inserts memory barriers so that the processors will coordinate to observe a global order on memory operations. Later on we'll get a chance to look at how the barriers are actually inserted at the assembly level. This enforcement is quite slow. If all variables are protected by memory barriers, it kind of defeats the purpose of having multiple cores. But for now, we see it solves the problem of correctness. Recapitulation and what's to come Here we have covered: Sequential consistency is the programming model computer systems strive to deliver. It's always been maintained on uniprocessor programming. When programming on multiple processors, at times programmers need to explicitly enforce sequential consistency on their own. There are language constructs programmers can take advantage of. In the next episode, we'll dig deeper into some of the more relaxed memory models and how they are supposed to be used to maintain the illusion of sequential consistency and at the same time produce as much parallelism between multiple cores as possible. Disclaimer : Compilation in this article is done with clang++ 10, with compiler flags "-O2 -std=c++20 -lpthread". YMMV as to the execution results.