Everything You Need to Know About Multithreading: Sequential Consistency Concept [Part 1]

Written by xylophoon | Published 2020/10/19
Tech Story Tags: programming | learning | hardware | cplusplus | c++ | multithreading | debugging | hackernoon-top-story

TLDR Everything You Need to Know About Multithreading: Sequential Consistency Concept [Part 1] Professor Xoogler, Ph.d in computer systems, explains the concept of out-of-order execution. Computers can execute multiple instructions at a time, so multiple instructions can be performed in parallel. Software is able to harness the power of parallelism while reining in the unwanted impact of the power while not overlapped. The idea of out of order execution breaks some common-sense assumptions about writing programs and software has to insert barriers to make things right.via the TL;DR App

You probably have heard of the idea of out-of-order execution, and that it breaks some common-sense assumptions about writing programs, and software has to insert barriers to make things right. That was a baffling concept to me at least when I heard of it. At the very least, why would the hardware want to execute out of order only to have the software correct the behavior back? What's the point? Is the software taking care of all the quirks for me? And if not, where's the catch?
Well, all this conundrum comes from the evolution of processor design to create parallelism and the software efforts to harness the power while reining in the unwanted impact. Let's start with a tiny bit of history.

Sequential consistency on uniprocessor

Going back to the good old days when your CPU was just single-issue (meaning executing one instruction at a time), and there's no multi-threading programming to speak of, the programming model is straight forward -- there's memory and a stream of instructions executing in program order, doing reading/writing to the memory.
For as long as the program lives, memory will reliably hold results written to it, and reproduce the same values when read afterwards. That is, a read after a write to the same memory location will faithfully return the value written. A write after a write to the same memory location will overwrite the old value with the new. To illustrate, look at this simplistic example1:
  1 #include <iostream>
  2 
  3 int main() {
  4   int i = 1;
  5   std::cout << i << ' ';
  6   int j = 5;
  7   std::cout << i << ' ';
  8   std::cout << j << ' ';
  9   i = 2;
 10   std::cout << i << ' ';
 11   std::cout << j << '\n';
 12   return 0;
 13 }
Easy to predict, the result will be
1 1 5 2 5
. At first variable
i
assumes the value
1
. Up until line 9, where the value's updated, no matter how many times it's read the location will always return the same value. Right after 2 is written to it, reading that variable will invariably return
2
.
There's also the variable
j
. Since it's a different location, what value is written to it doesn't concern the output of
i
whatsoever.
Now let's look at a 2-threaded example to add a bit of nondeterminism. Keep in mind we're still executing on a single CPU. Imagine we have two functions
func0
and
func1
executed by two separate threads, in the following example2:
int i = 0;
int j = 0;
int output[2];

void func0() {
  i = 1;
  output[1] = j;
}

void func1() {
  j = 1;
  output[0] = i;
}
After the program finishes what are the valid outputs? There are roughly four instructions here. Two reads and two writes. Depending on how they are mixed,
01
and
10
are both likely -- if one function happens to execute before another then these will ensue.
11
is possible too, so long as the reads are both performed after the writes are done.
You probably have noticed
00
is not possible. No matter how you execute, the first instruction will be writing a
1
to either of the output elements.
When you make this conclusion, you are of course intuitively making assumptions that the programming model is sequential consistency. To summarize, this means that even though the instructions from different threads may be interleaved in a nondeterministic way, each thread's execution is in program order. Also the operations are not overlapped. One memory access is done before another begins.
Over the years, processor design has improved drastically, and computer architects play all kinds of tricks to extract parallelism from executions. For example, here in our example you may notice that in
func0
, the two operations can be performed in parallel, so long as they both execute before
func1
. Given a possible instruction sequence, we'll be free to reorder the instructions to maintain the impression that the program still runs with sequential consistency, as long as we maintain the order of memory operations to the same variable!
Computer architects actually went ahead and took advantage of this to implement superscalar processors that can execute multiple instructions at a time. They tracked the operation dependencies using algorithms like Scoreboarding and Tomasulo. In the history of processor you will see the architects playing nasty and nastier tricks, such as speculative execution, which caused a ton of trouble now in the cloud business.
But I have digressed here. The point is no matter how the system evolves, sequential consistency has been the de facto contract between programmers and system builders.

Consistency model on multi-processors

In the early to mid 2000's, the idea of SMP (symmetric multiprocessing) really caught on. The reason was that computer architects found it more and more difficult to dissipate the amount of heat generated by a single powerful processor. Using the same die area, they'd rather build multiple smaller "cores".
However, like the uniprocessor days, in each core, instructions may still be executed out-of-order. The read-write dependencies are still tracked by the same algorithms so as to maintain the illusion that it's as if the program was still executed in-order.
However, this dependency tracking won't be available across cores. Now looking back on our example2:
int i = 0;
int j = 0;
int output[2];

void func0() {
  i = 1;
  output[1] = j;
}

void func1() {
  j = 1;
  output[0] = i;
}
Suppose the two functions are executed on two separate processors, processor-0 will be free to think
func0
is all it's executing. So it's possible for this processor to read the
j
and write the output first. Processor-1 is in the same boat! So under this paradigm, the output
00
is actually possible!
To illustrate this, let's write a full-fledged program that can reproduce this:
#include <atomic>
#include <iostream>
#include <thread>

int i;
int j;
int output[2];

std::atomic<bool> ready;

void func0() {
  while (!ready.load()) {
  }

  i = 1;
  output[1] = j;
}

void func1() {
  while (!ready.load()) {
  }

  j = 1;
  output[0] = i;
}

void run_test() {
  i = 0;
  j = 0;
  ready.store(false);

  std::thread t0(func0);
  std::thread t1(func1);

  std::this_thread::sleep_for(std::chrono::milliseconds(5));
  ready.store(true);

  t0.join();
  t1.join();
  std::cout << output[0] << output[1] << std::endl;
}

int main() {
  for (;;)
    run_test();
}
Don't mind the boiler plates for now. The sleep in
run_test
and the
ready
atomic are to make the two threads start roughly at the same time.
On my x86 processor,
00
is not only possible, it's the dominant output (perhaps because reading is cheaper than writing with cache coherency protocols, which we will cover in later posts). In fact
01
and
10
combined only account for about 25% of the output.
For some reason
11
is really rare, next to nonexistent, but that is just because of the timing of the execution. If you tell both threads to yield between the read and the write, then you will see
11
most of the time.
Also, to illustrate this won't be a problem on uniprocessors, we can pin the two threads onto the same core by adding the following snippet right after you create the threads. Afterwards you will see output
00
completely disappear.
  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(0, &cpuset);
  pthread_setaffinity_np(t0.native_handle(), sizeof(cpu_set_t), &cpuset);
  pthread_setaffinity_np(t1.native_handle(), sizeof(cpu_set_t), &cpuset);

How do we fix the broken consistency

I don't know about you, but when I first saw this result my mind was blown. I knew some RISC processors can go aggressive on relaxing memory models, but I had always thought my old buddy x86 is a dependable fellow. X86 in fact, implements a memory model called TSO, or total store ordering. As the name suggests, the stores (writes) to all locations are globally ordered.
More on the relaxed memory model later, but the tl;dr here is that you can imagine TSO places write results in a buffer called LSQ, or load-store-queue, before they go out to the cache. Each read afterwards will in turn check this queue for an early hit, before they themselves go out to the cache. The reason is that cache access takes much longer than register access, and it's valuable to keep a buffering structure on-core.
But of course the consequence is that your read may bypass previous writes to different locations, as long as you can find a result in the LSQ, sometimes breaking sequential consistency.
The fix is to enforce sequential consistency by removing the reordering. C++ gives you a standard way of doing this. Instead of declaring
i
and
j
as plain integers, we'll wrap them around a
std::atomic
type, and the reads and writes will be through the
load()
and
store()
methods.
By default these methods will enforce sequential consistency, even when these variables are accessed across different processors. So our new program will look like this:
#include <atomic>
#include <iostream>
#include <thread>

std::atomic<int> i;
std::atomic<int> j;
int output[2];

std::atomic<bool> ready;

void func0() {
  while (!ready.load()) {
  }

  i.store(1);
  output[1] = j.load();
}

void func1() {
  while (!ready.load()) {
  }

  j.store(1);
  output[0] = i.load();
}

void run_test() {
  i.store(0);
  j.store(0);
  ready.store(false);

  std::thread t0(func0);
  std::thread t1(func1);

  std::this_thread::sleep_for(std::chrono::milliseconds(5));
  ready.store(true);

  t0.join();
  t1.join();
  std::cout << output[0] << output[1] << std::endl;
}

int main() {
  for (;;)
    run_test();
}
As promised before, we have now converted
i
and
j
into atomics. And voilĂ , the pattern
00
is eliminated even if we're running on 2 cores, fully bringing us to the sequential consistency land!
Underlying the hood, the compiler inserts memory barriers so that the processors will coordinate to observe a global order on memory operations. Later on we'll get a chance to look at how the barriers are actually inserted at the assembly level.
This enforcement is quite slow. If all variables are protected by memory barriers, it kind of defeats the purpose of having multiple cores. But for now, we see it solves the problem of correctness.

Recapitulation and what's to come

Here we have covered:
  1. Sequential consistency is the programming model computer systems strive to deliver. It's always been maintained on uniprocessor programming.
  2. When programming on multiple processors, at times programmers need to explicitly enforce sequential consistency on their own. There are language constructs programmers can take advantage of.
In the next episode, we'll dig deeper into some of the more relaxed memory models and how they are supposed to be used to maintain the illusion of sequential consistency and at the same time produce as much parallelism between multiple cores as possible.
Disclaimer: Compilation in this article is done with clang++ 10, with compiler flags "-O2 -std=c++20 -lpthread". YMMV as to the execution results.

Written by xylophoon | Xoogler, Ph.d in computer systems. Startuper in the San Francisco bay area.
Published by HackerNoon on 2020/10/19