Concurrency Black-Art: Excerpts from Real-world concurrency

Recently, I was reading a literature about concurrency. I thought of putting the excerpts here.

If you want to read, http://queue.acm.org/detail.cfm?id=1454462 . It’s only 10 pages(short and concise).

Tricks to write multi threading

Know your cold path from hot paths

The hot path is the code that can be executed in parallel and the cold paths is the code that can be executed sequentially without affecting performance

In cold paths, keep the locking as coarse-grained as possible.
In hot paths, locking should be simple and fine grained.
If you don’t know the path, assume it’s a cold path and use coarse-grained locking. Be prepared to be proven wrong.

Intuition is frequently wrong — be data intensive

Once the software(or the prototype) is functional, the time for intuition is ended. Now, data should be gathered to verify the correctness and improvement.

In order to test, you should know how to put load on the system. Consider load generation and simulation as a first class problem; so should handle this during development(the earlier is better). Although a test load should mimic its production equivalent as closely as possible, timeliness is more important than absolute accuracy.
The second step is to understand the scalability inhibitors on a production system which requires the ability to safely dynamically instrument its synchronization primitives. Dynamic instrumentation will provide the data needed to find those parts of the system that are inhibiting scalability. It will also help gather sufficient data to understand which techniques will be suitable for reducing contention.

Know when and when not to break up a lock

The Linux kernel used to have one global lock in the kernel to provide concurrency control required by symmetric multiprocessing (SMP) systems. Eventually it was removed in Linux 2.6. This improved performance a lot.

Python has a giant lock called Global Interpreter Lock which doesn’t allow multiple code to be executed in multiple native threads. It’s still there and developers use multiprocessing to achieve native multi threading. http://programmers.stackexchange.com/a/186909/91286

Global locks can naturally becomes scalability inhibitors, it is reasonable to break up the lock into per-CPU locks, a hash table of locks, etc.
Breaking up a lock is not the only way to reduce contention, and contention can be(and often is) more easily reduced by decreasing the hold time of the lock(by algorithmic optimization or finding activity that is needlessly protected by the lock).

Be wary of readers/writer locks

The mutex is basically a test-and-set operation. The requesting thread checks a flag and if not set, acquires the mutex. Reader/Writer lock has a number(number of writers) which can be updated atomically. The cost is due to atomic update of this number.

Reader/Writer lock doesn’t scale well when lock is not long, because the state associated with the Reader/Writer lock must itself be updated atomically.
In the absence of any sophisticated synchronization primitives, a reader/writer lock will use a single word of memory to store the number of readers. The number of readers must be updated atomically, acquiring the lock as a reader requires the same bus transaction, as acquiring a mutex, and contention on that line can hurt also.
Reader/writer lock should be carefully used when a writer is blocked, reader can’t recursively acquire a lock as a reader and when a writer blocks between initial acquisition as a reader and recursive acquisition as a reader, deadlock will occur.

Consider per-CPU locking

Per-CPU locking can be a convenient technique for diffracting contention, as a per-CPU lock is not likely to be contended.
Let’s say, there are two types of threads with operating modes of different coherence requirements. One has a shorter hold times(and occur frequently) than the other and it can acquire the a per-CPU lock. The other type can grab all the per-CPU locks to construct coherent states. This paper http://dl.acm.org/citation.cfm?id=236156.236174 implements distributed reader writer lock using per-CPU locking.
It should be employed only when the data indicates that it’s necessary, as it introduces substantial complexity in implementation. In cold path, all per-CPU locks should be acquired in a single order.

Know when to broadcast — and when to signal

All condition variable implementation allow threads waiting on the variable to be awakened either via a signal(one thread sleeping on the variable is awakened) or via a broadcast(all threads sleeping on the variable are awakened). Broadcast can cause a performance hit if a thread wakes up and finds the resource is not available and sleeps again. Every time a thread goes to sleep, a call to scheduler is made(expensive due to kernel call).

Broadcast should be used to indicate state change and signal should be used for resource availability.

Learn to debug postmortem

Debugging from a core dump is called postmortem. The running program is modified to record important system state. When the program fails, it generates a core file which is the copy the program state. The programmer can later understand what happened and why the software failed by analyzing this file.

Problems in highly parallel systems are not necessarily reproducible, and a single core dump is often one’s only chance to debug them.
One needs to have only a list of threads, their corresponding stack backtraces, and some knowledge of the system. The information is contained in a snapshot of state(core file).

Design your systems to be composable

One can make locking entirely internal to the subsystem. e.g. in concurrent OS, locking never returns to user level with in-kernel locks held. This model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable.
One can achieve concurrency and composability by having no locks whatever. In this case, there must be no global subsystem state — subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they don’t access their instance in parallel. By leaving locking up to the client of the subsystem itself can be used concurrently by different subsystems and in different contexts. I don’t believe in this statement. It’s like keeping the boat in the harbor for safety.

Don’t use a semaphore where a mutex would suffice

A mutex has a notion of ownership: the lock is either owned or not, and if it is owned, it has a known owner. A semaphore has no notion of ownership: when sleeping on a semaphore, one has no way of knowing which thread one is blocking upon.

There is no way of propagating the blocking thread’s scheduling priority to the thread that is in critical section. This ability to propagate scheduling priority — priority inheritance — is critical in realtime system, and in the absence of other protocols, semaphore-based systems will always be vulnerable to priority inversion.
Lack of ownership deprives the system of the ability to make assertions about itself. When ownership is tracked, the machinery that implements thread blocking can detect deadlocks and recursive lock acquisition.

Consider using nonblocking synchronization routines to monitor contention

Many synchronization primitives have a different entry points to specify different behavior if the primitive is unavailable. The default entry point will typically block, whereas an alternate entry point will return an error code instead of blocking. e.g. mutex.lock() and mutex.try_lock() in C++11. Failure to acquire lock in the first case will put the thread in a queue. This is expensive.

The second variant can be used to monitor one’s own contention. The failure indicates there is a contention. This failure indication can be used to dynamically reduce contention.

When reacquiring locks, consider using generation counts to detect state change

When lock ordering become complicated, at times one will need to drop one lock, acquire another, and then reacquire the first. The state protected by the first lock may have changed since the lock was dropped. Verifying this may be problematic.

In this case, a generation count may be associated with the data structure, so every time the data is changed the count is incremented.
The logic that drops and reacquires the lock must cache the generation count before dropping the lock. After reacquisition, the count is compared and the state change logic may react accordingly.

Use wait- and lock-free structures only if absolutely must

Lock-free algorithms are not always practical.

Writing lock-free code is difficult.

Writing correct lock-free code is extremely difficult.

Use lock free structures if locks can’t be used for reasons of correctness.

Prepare for the thrill of victory — and the agony of defeat

A system will not scale until all impediments to scalability have been removed, but it is often impossible to know if the current impediment to scalability is the last one.

Removing the last impediment will give the best performance.
Sometimes after breaking up a complicated lock, you will realize it’s the last one, it’s merely hiding another impediment to scalability. Removing this increases a very little performance or not at all.

Thanks for reading !!!