This post is a follow up of . In this previous post, we saw a possible implementation of the merge sort algorithm using Java . Parallel Merge Sort in Java ForkJoinPool In a nutshell, the solution was based on: A thread pool to limit the number of threads A work-stealing strategy in order to optimize the distribution of the workload Interestingly enough, the solution in Go leveraging parallelism is based on the same high-level concepts. Sequential Implementation The sequential implementation of the merge sort algorithm is the following: If you are not familiar with Go, please note that is a (basically, a dynamically-sized version of an array). Each sub-call to is made by creating another slice which is just a view of the initial slice (and not a copy). The first call is done on the first half of the slice whereas the second call is done on the second half. At the end, the function merges both halves and make sure that becomes sorted. s slice mergesort merge s Let’s run a benchmark to give us a starting point. The test is running on an Intel Core i7–7700 at 3.60GHz with 8 virtual CPU cores (because of Hyper-threading technology) on a slice composed of elements. 1 million benchSequential avgt 129.8 ms/op Parallel Implementation Pseudocode In pseudocode, the parallel solution could be something like that: mergesort(s []int) {if len(s) > 1 {middle := len(s) / 2 // Create two sub-tasks **createTask -> mergesort(s\[:middle\])** **createTask -> mergesort(s\[middle:\])** merge(s, middle) }} Instead of creating one thread per call, the main idea is to handle parallelism through a thread pool and to distribute the workload. A task is wrapping the intention of applying on a sub-part of the slice. Every task becomes then available for the threads of the pool. mergesort mergesort Concurrency/Parallelism in Go In Go, a developer can’t instantiate by himself a thread. The main primitive for concurrency/parallelism is the . goroutine A goroutine can be pictured as an . Instead of being context-switched on and off a CPU core, a goroutine is context-switched on and off an OS thread. The magic is handled by the which runs at user space, just like our application. application-level thread Go scheduler The main task of any scheduler is to distribute a workload. In Go, the scheduler distributes the goroutines over multiple threads. There are two important points to mention though. First, it is worth understanding that compared to the OS scheduler which is a preemptive system (interruption every ms), the Go scheduler is . It means a goroutine will be context-switched off an OS thread only (for example in the case of a blocking I/O call). This will prevent multiple context-switches which may occur before to complete a given job. x cooperative on purpose The second point is about the Go scheduling strategy. The two main strategies are the following: : When a processor generates new threads, it attempts to migrate some of them to the other processors with the hopes of them being utilized by the idle/underutilized processors. Work-sharing : An underutilized processor actively looks for other processor’s threads and “steal” some. Work-stealing in JBD Go’s work-stealing scheduler Go’s scheduler is based on the second strategy, . The main benefit of this strategy is to reduce the frequency of goroutines migration (from one thread/CPU core to another). The migration is only considered when a thread is idle (no more goroutines to handle or every goroutine is blocked due to an I/O call for example) and this is going to have a positive impact performance-wise. work-stealing As a summary: Cooperative system: a goroutine is context-switched off an OS thread purposely The Go runtime is based on a work-stealing strategy to optimize the distribution of the goroutines First version To implement a parallel version of the merge sort algorithm, we need to delegate the management of the two halves in two separated goroutines. Running a goroutine through keyword is a pure asynchronous task so we need to wait for the completion of both goroutines before to run . go merge Let’s write a v1: The wait operation is managed by . Before to trigger the goroutines, we configure it to wait for 2 calls. var wg sync.WaitGroup wg.Done() Let’s run a benchmark of this implementation: benchSequential avgt 129.8 ms/op benchParallelv1 avgt 438.8 ms/op Our v1 is about 3 times slower than the sequential implementation. If you have read the previous post, you may have already noticed that we are probably facing the very same problem that we already described. If we have an initial slice of 1024 elements will be applied to 512 elements, then 256, then 128 and so on until reaching 1 element. It means even for running on a single element, we are going to create a dedicated goroutine for that. mergesort mergesort It’s true that goroutines are more lightweight and faster to start than threads. Yet, creating a goroutine for a small slice and having to wait for the Go runtime to schedule it is than calling the sequential implementation in the same goroutine. way more expensive Let’s run go trace to check our assumption on a one millisecond time range: go trace on a one millisecond time range (v1) First, we see that all CPU cores are used during the execution which is a good starting point. Yet, we see a lot of (the thin rectangles) and between all of them, a blank space meaning the CPU core is not fully dedicated to running our algorithm. The blank spaces are mainly due to the overhead introduced by a large number of context switches (because of a large number of goroutines). sticks The tracer also gives us some interesting metrics per goroutine instance: Scheduler wait time: the time awaited by a goroutine to be scheduled GC pause time: the time spent in performing GC operations It’s worth mentioning that the goroutines analysis UI lists the following goroutines for the v1: Goroutines analysis Why do we have two times the same function (suffixed by and )? Because each call to triggers two new goroutines. So they are listed as two different functions. func1 func2 mergesort Our implementation generates the following results: Scheduler wait time (func1 and func2): avg 156.7 msGC pause time: avg 792.3 ms Bear in mind that the figures above cannot be directly compared with the benchmarks we ran previously (the 438.8 ms/op). The main reason is that enabling tracing does generate an overhead (on my environment about 60%). Yet, it will be still interesting to compare those figures with the future versions. Second version Let’s define a threshold called . If the length of the slice is greater than we run the parallel implementation, otherwise we run the sequential one: max max After few tests, I figured out that the most optimal value for was something close to which is equals to 2048. In Java set this threshold to . max 1<<11 Arrays.parallelSort() 1<<13 Let’s benchmark this new version: benchSequential avgt 129.8 ms/opbenchParallelv1 avgt 438.8 ms/op benchParallelv2 avgt 47.3 ms/op Far better this time. We are almost 3 times faster than the sequential implementation. Let’s check to go trace again on a one millisecond time range: go trace on a one millisecond time range (v2) Please note that on this picture we can now see two lines per CPU core: The first line showing the goroutines execution The second line showing the GC operations (the purple rectangles) We don’t see anymore the thin sticks representing the job which were not interesting to parallelize. Meanwhile, we drastically reduced the number of blank spaces (which means our solution is using the CPU cores more efficiently). Regarding the figures: // v1Scheduler wait time (func1 and func2): avg 156.7 msGC pause time: avg 792.3 ms Scheduler wait time (func1): avg 4 msScheduler wait time (func2): avg 1.3 msGC pause time: avg 30.9 ms // v2 Few observations: The average scheduler wait time is drastically smaller meaning that the goroutines are spending way less time in waiting for being scheduled Scheduling the second call ( ) is 3 times faster than the first call func2 The GC pause time is also drastically smaller which is the direct consequence of creating way fewer goroutines Third version There is a last optimization which can be done regarding the goroutines management. In this last version, instead of spawning the calls in two separated goroutines, we can just spawn the first call and let the current goroutine handling the second call. mergesort This way, the number of goroutines triggered becomes linear instead of being exponential. Let’s check the impacts in terms of latency: benchSequential avgt 129.8 ms/opbenchParallelv1 avgt 438.8 ms/opbenchParallelv2 avgt 47.3 ms/op benchParallelv3 avgt 45.7 ms/op It is a slight improvement, but an improvement anyway. What does go trace tells us? go trace on a one millisecond time range (v3) We can observe that the average size of the purple rectangles (representing the GC operations) is smaller. Let’s confirm this observation with the goroutines analysis: // v1Scheduler wait time (func1 and func2): avg 156.7 msGC pause time: avg 792.3 ms // v2Scheduler wait time (func1): avg 4 msScheduler wait time (func2): avg 1.3 msGC pause time: avg 30.9 ms **// v3**Scheduler wait time (func1 only): avg 4 msGC pause time: avg 19.9 ms The scheduler wait time remains equivalent to the v2 but the GC pause time is 35% faster (which is again the consequence of reducing the number of goroutines). Conclusion In , we showed that the best implementation was based on a thread pool and a work-stealing scheduling strategy. Parallel Merge Sort in Java In Go, as a developer, we don’t have direct access to the created threads. Yet, the Go runtime will manage a limited number of threads to run our solution. This number will tend towards the number of CPU cores. Furthermore, the Go runtime scheduling strategy is also based on work-stealing since Go 1.1. Despite two different implementations in two different languages, we can see that they are both relying on the same high-level concepts (and that’s pretty cool 🙂). Further Reading _Contribute to teivah/golang-parallel-mergesort development by creating an account on GitHub._github.com teivah/golang-parallel-mergesort _Go scheduler's job is to distribute runnable goroutines over multiple worker OS threads that runs on one or more…_rakyll.org Go's work-stealing scheduler _Introduction In the first part of this scheduling series, I explained aspects of the operating-system scheduler that I…_www.ardanlabs.com Scheduling In Go - Part II _How I used benchmarking, profiling, and tracing to heavily optimize a program_medium.com Go code refactoring : the 23x performance hunt _A parallel merge sort implementation in Java using ForkJoinPool_hackernoon.com Parallel Merge Sort in Java
Share Your Thoughts