Parallel Merge Sort in Go

This post is a follow up of Parallel Merge Sort in Java. In this previous post, we saw a possible implementation of the merge sort algorithm using Java ForkJoinPool.

In a nutshell, the solution was based on:

A thread pool to limit the number of threads
A work-stealing strategy in order to optimize the distribution of the workload

Interestingly enough, the solution in Go leveraging parallelism is based on the same high-level concepts.

Sequential Implementation

The sequential implementation of the merge sort algorithm is the following:

If you are not familiar with Go, please note that s is a slice (basically, a dynamically-sized version of an array). Each sub-call to mergesort is made by creating another slice which is just a view of the initial slice (and not a copy). The first call is done on the first half of the slice whereas the second call is done on the second half. At the end, the merge function merges both halves and make sure that s becomes sorted.

Let’s run a benchmark to give us a starting point. The test is running on an Intel Core i7–7700 at 3.60GHz with 8 virtual CPU cores (because of Hyper-threading technology) on a slice composed of 1 million elements.

benchSequential              avgt    129.8    ms/op

Parallel Implementation

Pseudocode

In pseudocode, the parallel solution could be something like that:

mergesort(s []int) {
  if len(s) > 1 {
    middle := len(s) / 2

    // Create two sub-tasks
    createTask -> mergesort(s[:middle])
    createTask -> mergesort(s[middle:])

    merge(s, middle)
  }
}

Instead of creating one thread per mergesort call, the main idea is to handle parallelism through a thread pool and to distribute the workload. A task is wrapping the intention of applying mergesort on a sub-part of the slice. Every task becomes then available for the threads of the pool.

Concurrency/Parallelism in Go

In Go, a developer can’t instantiate by himself a thread. The main primitive for concurrency/parallelism is the goroutine.

A goroutine can be pictured as an application-level thread. Instead of being context-switched on and off a CPU core, a goroutine is context-switched on and off an OS thread. The magic is handled by the Go scheduler which runs at user space, just like our application.

The main task of any scheduler is to distribute a workload. In Go, the scheduler distributes the goroutines over multiple threads. There are two important points to mention though.

First, it is worth understanding that compared to the OS scheduler which is a preemptive system (interruption every x ms), the Go scheduler is cooperative. It means a goroutine will be context-switched off an OS thread only on purpose (for example in the case of a blocking I/O call). This will prevent multiple context-switches which may occur before to complete a given job.

The second point is about the Go scheduling strategy. The two main strategies are the following:

Work-sharing: When a processor generates new threads, it attempts to migrate some of them to the other processors with the hopes of them being utilized by the idle/underutilized processors.

Work-stealing: An underutilized processor actively looks for other processor’s threads and “steal” some.

JBD in Go’s work-stealing scheduler

Go’s scheduler is based on the second strategy, work-stealing. The main benefit of this strategy is to reduce the frequency of goroutines migration (from one thread/CPU core to another). The migration is only considered when a thread is idle (no more goroutines to handle or every goroutine is blocked due to an I/O call for example) and this is going to have a positive impact performance-wise.

As a summary:

Cooperative system: a goroutine is context-switched off an OS thread purposely
The Go runtime is based on a work-stealing strategy to optimize the distribution of the goroutines

First version

To implement a parallel version of the merge sort algorithm, we need to delegate the management of the two halves in two separated goroutines. Running a goroutine through go keyword is a pure asynchronous task so we need to wait for the completion of both goroutines before to run merge.

Let’s write a v1:

The wait operation is managed by var wg sync.WaitGroup. Before to trigger the goroutines, we configure it to wait for 2 wg.Done() calls.

Let’s run a benchmark of this implementation:

benchSequential              avgt    129.8    ms/op
benchParallelv1              avgt    438.8    ms/op

Our v1 is about 3 times slower than the sequential implementation. If you have read the previous post, you may have already noticed that we are probably facing the very same problem that we already described.

If we have an initial slice of 1024 elements mergesort will be applied to 512 elements, then 256, then 128 and so on until reaching 1 element. It means even for running mergesort on a single element, we are going to create a dedicated goroutine for that.

It’s true that goroutines are more lightweight and faster to start than threads. Yet, creating a goroutine for a small slice and having to wait for the Go runtime to schedule it is way more expensive than calling the sequential implementation in the same goroutine.

Let’s run go trace to check our assumption on a one millisecond time range:

go trace on a one millisecond time range (v1)

First, we see that all CPU cores are used during the execution which is a good starting point. Yet, we see a lot of sticks (the thin rectangles) and between all of them, a blank space meaning the CPU core is not fully dedicated to running our algorithm. The blank spaces are mainly due to the overhead introduced by a large number of context switches (because of a large number of goroutines).

The tracer also gives us some interesting metrics per goroutine instance:

Scheduler wait time: the time awaited by a goroutine to be scheduled
GC pause time: the time spent in performing GC operations

It’s worth mentioning that the goroutines analysis UI lists the following goroutines for the v1:

Goroutines analysis

Why do we have two times the same function (suffixed by func1 and func2)? Because each call to mergesort triggers two new goroutines. So they are listed as two different functions.

Our implementation generates the following results:

Scheduler wait time (func1 and func2): avg 156.7 ms
GC pause time: avg 792.3 ms

Bear in mind that the figures above cannot be directly compared with the benchmarks we ran previously (the 438.8 ms/op). The main reason is that enabling tracing does generate an overhead (on my environment about 60%). Yet, it will be still interesting to compare those figures with the future versions.

Second version

Let’s define a threshold called max. If the length of the slice is greater than max we run the parallel implementation, otherwise we run the sequential one:

After few tests, I figured out that the most optimal value for max was something close to 1<<11 which is equals to 2048. In Java Arrays.parallelSort() set this threshold to 1<<13.

Let’s benchmark this new version:

benchSequential              avgt    129.8    ms/op
benchParallelv1              avgt    438.8    ms/op
benchParallelv2              avgt     47.3    ms/op

Far better this time. We are almost 3 times faster than the sequential implementation. Let’s check to go trace again on a one millisecond time range:

go trace on a one millisecond time range (v2)

Please note that on this picture we can now see two lines per CPU core:

The first line showing the goroutines execution
The second line showing the GC operations (the purple rectangles)

We don’t see anymore the thin sticks representing the job which were not interesting to parallelize. Meanwhile, we drastically reduced the number of blank spaces (which means our solution is using the CPU cores more efficiently).

Regarding the figures:

// v1
Scheduler wait time (func1 and func2): avg 156.7 ms
GC pause time: avg 792.3 ms

// v2
Scheduler wait time (func1): avg 4 ms
Scheduler wait time (func2): avg 1.3 ms
GC pause time: avg 30.9 ms

Few observations:

The average scheduler wait time is drastically smaller meaning that the goroutines are spending way less time in waiting for being scheduled
Scheduling the second call (func2) is 3 times faster than the first call
The GC pause time is also drastically smaller which is the direct consequence of creating way fewer goroutines

Third version

There is a last optimization which can be done regarding the goroutines management. In this last version, instead of spawning the mergesort calls in two separated goroutines, we can just spawn the first call and let the current goroutine handling the second call.

This way, the number of goroutines triggered becomes linear instead of being exponential.

Let’s check the impacts in terms of latency:

benchSequential              avgt    129.8    ms/op
benchParallelv1              avgt    438.8    ms/op
benchParallelv2              avgt     47.3    ms/op
benchParallelv3              avgt     45.7    ms/op

It is a slight improvement, but an improvement anyway.

What does go trace tells us?

go trace on a one millisecond time range (v3)

We can observe that the average size of the purple rectangles (representing the GC operations) is smaller. Let’s confirm this observation with the goroutines analysis:

// v1
Scheduler wait time (func1 and func2): avg 156.7 ms
GC pause time: avg 792.3 ms

// v2
Scheduler wait time (func1): avg 4 ms
Scheduler wait time (func2): avg 1.3 ms
GC pause time: avg 30.9 ms

// v3
Scheduler wait time (func1 only): avg 4 ms
GC pause time: avg 19.9 ms

The scheduler wait time remains equivalent to the v2 but the GC pause time is 35% faster (which is again the consequence of reducing the number of goroutines).

Conclusion

In Parallel Merge Sort in Java, we showed that the best implementation was based on a thread pool and a work-stealing scheduling strategy.

In Go, as a developer, we don’t have direct access to the created threads. Yet, the Go runtime will manage a limited number of threads to run our solution. This number will tend towards the number of CPU cores. Furthermore, the Go runtime scheduling strategy is also based on work-stealing since Go 1.1.

Despite two different implementations in two different languages, we can see that they are both relying on the same high-level concepts (and that’s pretty cool 🙂).