Breaking Axioms in Program Execution

Making new mistakes

I’ve been a software engineer for about 15 years now. Throughout my career I learned a lot and applied these learnings to design and implement (and occasionally phase out or leave as is) many distributed systems. Along the way I made numerous mistakes and still keep making them. But as my main focus was reliability I’ve been looking back at my experience and the community to find ways to minimize the error frequency. My motto is: we must absolutely try making new mistakes (less obvious, more sophisticated). Making a mistake is fine - that is how we learn, repeating - is sad and discouraging.

That’s probably what has always fascinated me about mathematics. Not only because it is elegant and concise, but because its logical rigor prevents mistakes. It forces you to think about your current context, what postulates and theorems you can rely on. Following these rules proves to be fruitful, you get the correct result. It is true, that computer science is a branch of mathematics. But what we usually practice is software engineering, a very distinct thing. We apply computer science achievements and discoveries to practice, accounting for time constraints and business needs. This blog is an attempt to apply semi-mathematical reasoning onto design and implementation of computer programs. We’ll put forward a model of different execution regimes provides a framework to avoids many programming errors.

From humble beginnings

When we learn to program and make our first tentative (or daring) steps we usually start with something simple:

programming loops, doing basic arithmetic and printing the results in a terminal
solving math problems, probably in some specialized environment as MathCAD or Mathematica

We acquire muscle memory, learn language syntax and most importantly we change the way we think and reason. We learn to read the code, make assumptions about how it is being executed. We almost never start by reading a language standard and closely read through its “Memory model” section - because we are not yet equipped to fully appreciate and make use of those. We practice trial and error: we do introduce logical and arithmetic bugs in our first programs. These mistakes teach us to check our assumptions: is this loop invariant correct, can we compare array element’s index and length this way (where do you put this -1)? But if we don’t see some kind of errors, oftentimes implicitly we internalize some invariants the system enforces and provides us.

Namely this one:

Lines of code are always evaluation in the same order (serialized).

This postulate allows us to assume that next propositions are true (we are not going to prove them):

evaluation order does not change between executions
function calls always return

Mathematical axioms allow to derive and build larger structures on a solid basis. In mathematics, we have Euclidean geometry with 4+1 postulates. The last one says:

parallel lines stay parallel, they do not intersect or diverge

For millennia mathematicians tried to prove it, and derive it from the first four. Turns out that is not possible. We can replace this “parallel lines” postulate with alternatives and get different kinds of geometries (namely, hyperbolic and elliptic), which open up new prospects and turn out to be applicable and useful. After all our own planet’s surface is not flat and we have to account for that e.g. in GPS software and airplane routes.

The need for change

But before that let’s stop and ask the most engineering questions: why bother? If the program does its job, it is easy to support, maintain and evolve, why should we give up this cozy invariant of predictable sequential execution in the first place?

I see two answers. First one is performance. If we can make our program run twice as fast or similarly - require half of the hardware - this is an engineering achievement. If using the same amount of computational resources we can grind through 2x (or 3, 4, 5, 10x) of data - it may open completely new applications of the same program. It may run on a mobile phone in your pocket instead of a server. Sometimes we can achieve speeds up by applying clever algorithms or rewriting in a more performant language. These are our first options to explore, yes. But they have a limit. Architecture almost always beats the implementation. Moor’s law has not been doing that well lately, a single CPU’s performance is growing slowly, RAM performance (latency, mainly) is lagging behind. So, naturally, engineers started to looking for other options.

Second consideration is reliability. Nature is chaotic, second law of thermodynamics constantly works against anything precise, sequential and repeatable. Bits flip, materials degrade, power browns out, wires get cut preventing our programs execution. Keeping sequential and repeatable abstraction becomes a hard job. If our programs could outlive software and hardware failures we could deliver services that have a competitive business advantage - that’s another engineering task we can start addressing.

Equipped with the goal, we can start experiments with non-serialized approaches.

Threads of execution

Let’s look at this chunk of pseudo code:

```

def fetch_coordinates(poi: str) -> Point:

…

def find_pois(center: Point, distance: int) -> List[str]:

…

def get_my_location() -> Point:

…
def fetch_coordinates(p) - Point:
…

def main():

me = get_my_location()

for point in find_pois(me, 500):
loc = fetch_coordinates(point)
sys.stdout.write(f“Name: {point} is at x={loc.x} y={loc.y}”)

We can read the code top-to-buttom and reasonably assume that `find_pois` function will be called after `get_my_location`. And we’ll fetch and return coordinates of the first POI after fetching the next one. Those assumptions are correct and allow building a mental model, reason about the program.

Let’s imagine we can make our code to execute non-sequentially. There are many ways we can do that syntactically. We’ll skip experiments with statement reordering (that’s what modern compilers and CPUs do) and extend our language so that we’re able to express a new function execution regime: concurrently or even in parallel with respect to other functions. Rephrasing, we need to introduce multiple threads of execution. Our program functions execute in a specific environment (crafted and maintained by OS), right now we’re interested in addressable virtual memory and a thread - a scheduling unit, something that can be executed by a CPU.

Threads come in different flavours: a POSIX thread, green thread, coroutine, goroutine. Details differ greatly, but it boils down to something that can be executed. If several functions can run concurrently each needs its own scheduling unit. That is were multi-threading comes from, instead of one, we have several threads of execution. Some environments (MPI) and languages can create threads implicitly, but usually we have to do this explicitly using `pthread_create` in C, `threading` module classes in Python or a simple `go` statement in Go. With some precautions we can make the same code run mostly in-parallel:

def fetch_coordinates(poi, results, idx) -> None:
  …
  results[idx] = poi


def main():
  me = get_my_location()
  points = find_pois(me, 500)
  results = [None] * len(points)  # Reserve space for each result
  threads = []
  for i, point in enumerate(find_pois(me, 500)):  # i - index for result
    thr = threading.Thread(target=fetch_coordinates, args=(poi, results, i))
    thr.start()
    threads.append(thr)
 for thr in threads:
   thr.wait()
for point, result in zip(points, results):
  sys.stdout.write(f“Name: {poi} is at x={loc.x} y={loc.y}”)

We achieved our performance goal: our program can run on multiple CPUs and scale as the number of cores grows and finishes faster. Next engineering question we must ask: at what cost?

We intentionally gave up on serialized and predictable execution. There is no bijection between a function + point in time and the data. At each point in time there is always a single mapping between a running function and its data:

Next consequence is that one function can finish before another this time, next time it can be the other way. This new regime of execution leads to data races: when concurrent functions work with data it means that order of operations applied to the data is undefined. We start encountering data races and learn to deal with them using:

critical sections: mutexes (and spinlocks)
lock-free algorithms (the simplest form is in the snippet above)
race detection tools
etc

At this point, we discover at least two things. First, there are multiple ways to access data. Some data is local (e.g function scoped variables) and only we can see (and access it) and thus it is always in the state we’ve left it. However, some data is shared or remote. It still resides in our process memory, but we use special ways to access it and it might get out of sync. In some cases to work with it we copy it into our local memory to avoid data races - that’s why ==.clone() ==is popular in Rust.

When we continue this line of reasoning, other techniques like thread-local storage come in naturally. We’ve just acquired a new gadget in our programming toolbelt, expanding what we can achieve by building software.

However, there’s an invariant we still can rely on. When we reach out for shared (remote) data from a thread, we always get it. There is no situation when some memory chunk is not available. The OS will terminate all participants (threads) by killing the process if the backing physical memory region malfunctions. The same applies to “our” thread if we locked a mutex, there is no way we might lose the lock and must stop what we’re doing immediately. We can rely on this invariant (enforced by the OS and modern hardware) that all participants are either dead or alive. All share the fate: if process (OOM), OS (kernel bug) or hardware encounters a problem - all our threads will cease to exist together without external leftover side-effects.

Inventing a process

One important thing to note. How did we make this first step by introducing threads? We separated, forked. Instead of having one scheduling unit, we introduced multiple. Let’s keep applying this unsharing approach and see how it goes. This time we copy process virtual memory. That is called - spawning a process. We can run another instance of our program or start other existing utility. This is a great approach to:

reuse other code with strict boundaries
run untrusted code, isolating it from our own memory

Almost all ==modern browsers ==work this way, so they are able to run untrusted Javascript executable code downloaded from Internet and terminate it reliably when you close a tab without bringing down the whole application.

This is yet another regime of execution we discovered by giving up the shared fate invariant and unsharing virtual memory and making a copy. Copies are not free:

OS needs to manage memory related data structures (to maintain virtual -> physical mapping)
Some bits could have been shared and thus processes consume additional memory

Breaking free

Why stop here? Let’s explore what else can we copy and distribute our program among. But why go distributed in the first place? In many cases tasks at hand can be solved using a single machine.

We need to go distributed to escape shared fate tenets so that our software stops depending on the inevitable problems that underlying layers encounter.

To name a few:

OS upgrades: from time to time we need to reboot our machines
Hardware failures: they happen more often than we’d like to
External failures: power and network outages are a thing.

If we copy an OS - we call that a virtual machine and can run customers’ programs on a physical one and build a huge cloud business on it. If we take two or more computers and run our programs on each - our program can outlive even a hardware failure, providing 24/7 service and gain a competitive advantage. Large corporations long ago went even further and now Internet giants run copies in different datacenters and even continents, thus making a program resilient to a typhoon or a simple power outage.

But this independence comes at a price: the old invariants are not enforced, we’re on our own. No worries, we are not the first ones. There are a lot of techniques, tools and services to help us out.

Takeaways

We’ve just gained an ability to reason about systems and their respective execution regimes. Inside every large scale out system most parts are familiar sequential and stateless, many components are multi-threaded with memory types and hierarchies all held together by a mix of some truly distributed parts:

The goal is to be able to distinguish where we’re currently, what invariants hold and act (modify/design) accordingly. We highlighted the basic reasoning, transforming “unknown unknowns” into “known unknowns”. Don’t take it lightly, this is a significant progress.