In the previous posts (first, second) we have implemented a Floyd-Warshall algorithm (in four variants) and a routes reconstruction algorithm. In these posts we made our way through the basics of all-pairs shortest path problem, data representation in memory, parallelism, vectorisation and how we can adapt algorithms to data specifics. In this post we will continue our journey and explore a new, more efficient, way to solve all-pairs shortest path problem. However, this time, besides employing vector and parallel capabilities of the CPU we also will employ L1, L2 and L3 caches. Sounds interesting? Then let’s write some code 🙂 Well, not so fast - before exploring the theory and code behind the algorithm I suggest to spend a few moments and revise basic information about “CPU caches”. If you have a solid understanding about the topic you are free to skip it. In case you decided to read it (because you are just curious or you want to revise the basics) please take in account that the following information is an oversimplified overview. - Author’s note What is cache? Most probably you heard something about CPU cache, at least in advertisements of a new Intel or AMD chips or during any kind of hardware discussions with friends or colleagues. Here, we aren’t going to discuss what CPU cache exactly is or who is better in implementing it Intel, AMD or Apple. Instead, we will try to build a mental model of the cache and learn how to use it to: … better understand limits we can or can’t push when writing high-performance code in C# and .NET. … better understand application behaviour in high-performance scenarios on different platforms, with different parameters. … brag in front of our friend when someone says “cache”. So, let’s start from the basics… what a heck is a cache? Imagine yourself sitting in an empty apartment in front of a PC. You are working on a new feature and you are doing great. You are blazing fast, no errors and no mistakes. You are in a perfect flow. You are the CPU. Periodically, you need to consult documentation. All documentation you need is available in the library across the road. Library is the main memory. You are free to enter the library anytime and read everything you need. However, every visit to the library takes time and prevents you from advancing your work, which doesn’t sound productive. To improve the situation, you started to write down pieces of the documentation on sticky notes and put them on the display. Sticky note on a display is the first level cache. Now when you need something from the documentation you first look at sticky notes and only if information isn’t there you have to visit the library. When you found what you need on sticky note – it is called a cache hit, otherwise – a cache miss. Having sticky notes on the display reduced time spent on “library visits” but didn’t remove it completely because you can’t know what documentation you would need in advance. There is a natural limit of how much useful information you can bring back from the library. The amount of useful information you can bring from the library is called a cache line. You can’t have millions of sticky notes on the display, so when there is not enough space you have to throw some of them into trash. When you throw away sticky note because of space limitation – it is called an eviction from cache. To stop throwing useful information away (and stop visiting library when you need it again) you bought yourself a “small box”. Now, when you bring “new” sticky notes from library and need to free up some space on the display, instead of throwing “old” sticky notes into trash you put them into a “small box”. This way, if you need this information once again, you will be able to pick it up from the “small box” and put it back on the display and save yourself a library visit. “Small box” is the second level cache. The “small box” can fit much more sticky notes than the display, however, it is not unlimited. Therefore, when some of sticky notes don’t fit in the “small box” you still had to throw them in the trash. That is why you decided to buy a “huge box”. Now, when you bring new sticky notes from the library you first make copies of them and put these copies into a “huge box” and only then you put originals on the display. If you need to replace sticky notes on the display, you as previously, put them into a “small box” and if “small box” is overloaded you throw them into trash. However, this time if you need this information once again – you just make a copy of the sticky note from the “huge box” and put it back on the display. “Huge box” is the third level cache (aka LLC) “Huge box” doesn’t eliminate “library visits”, however, if the size of a “huge box” is considerably large, then you will end up having a lot of related pieces of documentation in it, which in turn will allow you to do a significant amount of work without interruption. That is basically it. The above model isn’t full nor ideal but, at least I hope so, it clearly illustrates the main idea of why modern chips have multiple levels cache — to reduce number of round trips to main memory, and keep CPU busy as long as possible. What might not be so obvious, from the model, is that all caches (L1 “sticky notes on a display”, L2 “small box” and L3 “huge box”) have different access time (aka latency) [1] and size. Here is an example for Intel i7-6700 (Skylake) CPU: Cache Access time (aka Latency) Size L1 4 cycles 64 KB L2 12 cycles 256 KB L3 42 cycles 8 MB Table 1. L1, L2 and L3 caches access times and sizes of Intel i7-6700 (Skylake) processor. Please note, the above numbers are just one example. Different CPUs have different characteristics. However, the overall relation more of less remains the same — L1 is the smallest but also the fastest one, L2 is somewhere in between, and L3 is the largest but also the slowest one. - Author’s note To truly understand these numbers, it is critical to know how much time it takes to access main memory. On the same Intel i7-6700 (Skylake) CPU accessing main memory takes around 42 cycles + 51 nanosecond. You can notice, that besides cycles, we also have a fixed interval of “51 nanosecond” here. This isn’t a mistake because cycles can be converted (approximately) to nanoseconds using a very simple formula: In our example, it would be: Converting access time values in previous table from cycles to access time gives us the following: Element Access time (aka Latency) L1 1 ns L2 3 ns L3 10.5 ns Memory 10.5 ns + 51 ns = 60.5 ns Table 2. L1, L2, L3 caches and main memory approximate access times in nanoseconds of Intel i7-6700 (Skylake) processor. Even taking in account that this is a very approximate conversion, which ignores a lot of details, it still looks stunning — accessing L1 cache is 60 times faster than main memory. Of course in real application it isn’t so simple, and cache related optimisations doesn’t result in x60 performance improvements, however, as I hope we would be able to see in this post, they still can significantly improve application performance. A reader familiar with the topic for sure would notice that I haven’t mentioned anything about cache modes (exclusive and inclusive), as well as anything related to instruction caches and other things like memory alignment, hardware and software prefetch, and so on. I did that intentionally for simplicity purposes. However, if you just stumbled on this note and want to know more, you can start from this Wikipedia page and if you want to start thinking about low-level performance I can recommend you a book by Denis Bakhvalov and a paper by Ulrich Drepper- Author’s note Blocked Floyd-Warshall algorithm Blocked version of Floyd-Warshall algorithm (aka Blocked Floyd-Warshall) was introduced in [2]. The algorithm works with a graph G of vertices V, represented as weight matrix W. It splits the matrix into blocks of equal sizes (creating a new matrix of blocks B) and uses these blocks to calculate all-pairs of shortest paths between all vertices in a graph. This definition doesn’t sound obvious, so here is an example. Imagine a 8×8 weight matrix W (see Picture 1a). We split the matrix into 2×2 blocks to create a new 4×4 matrix of blocks B (see Picture 1b). Every block of the matrix B includes a part of all paths of the matrix W, for instance (see Picture 2): B[0,0] includes paths from vertices 0 and 1 to 0 and 1 B[0,3] includes paths from vertices 0 and 1 to 6 and 7 B[2,0] includes paths from vertices 4 and 5 to 0 and 1 … and B[2,3] includes paths from vertices 4 and 5 to vertices 6 and 7 All vertices, represented by these four blocks are tightly related. Just looking at the matrix B it is noticeable that you can move from vertex 4 to 0, then from 0 to 7 and by those find a path from 4 to 7 through vertex 0, which, then can be compared with the existing path from 4 to 7, which can be updated if necessary with the shortest one (see Picture 3). If you remember the Floyd-Warshall algorithm, then these manipulations must sound extremely familiar because they resemble what it does: algorithm FloydWarshall(W) do for k = 0 to N - 1 do for i = 0 to N - 1 do for j = 0 to N - 1 do // Select the shortest path between existing path from i to j and // a new path from i to k to j, and replace the existing path // with the smallest // W[i, j] = min(W[i, j], W[i, k] + W[k, j]) end for end for end for end algorithm // where W - is a weight matrix of N x N size, // min() - is a function which returns lesser of it's arguments However, Floyd-Warshall algorithm works with matrix W (not B). Fortunately, it can be easily rewritten into a Procedure which accepts three blocks B1, B2 and B3 instead of matrix W: function Procedure(B1, B2, B3) do for k = 0 to L - 1 do for i = 0 to L - 1 do for j = 0 to L - 1 do B1[i, j] = min(B1[i, j], B2[i, k] + B3[k, j]) end for end for end for end function // where B1, B2 and B3 - are blocks from a block matrix B, // L - is a size of the blocks (B1, B2 and B3 are L x L size) // min() - is a function which returns lesser of it's arguments If we invoke the the Procedure with B1 = B[2,3], B2 = B[2,0] and B3 = B[0,3], it will recalculate all paths from vertices 4, 5 to vertices 6, 7 through vertices 0, 1. The important moment here is while Procedure indeed recalculates paths from vertices 4, 5 to vertices 6, 7 (through 0 and 1), these paths aren’t guaranteed to be THE SHORTEST because recalculation relies on the existing paths stored in blocks B[2,0] and B[0,3]. Luckily, prior to recalculating the paths in block B[2,3], we can recalculate paths in blocks B[2,0] and B[0,3] using … the same Procedure but with different input parameters (see Picture 4): To recalculate B[2,0] we invoke the Procedure with B1 = B[2,0], B2 = B[2,0] and B3 = B[0,0]. This recalculates all paths from vertices 4, 5 into 0, 1 through 0 and 1. To recalculate B[0,3] we invoke the Procedure with B1 = B[0,3], B2 = B[0,0] and B3 = B[0,3]. This recalculates all paths from vertices 0, 1 into 6, 7 through 0 and 1. You might have already noticed — recalculation of blocks B[2,0] and B[0,3] relies on the data from block B[0,0] in the same way as recalculation of block B[2,3] relies on blocks B[2,0] and B[0,3]. Luckily (again), prior to recalculating the paths in blocks B[2,0] and B[0,3], we can recalculate paths in B[0,0] using … the same Procedure but (again) with different input parameters (see Picture 5): To recalculate B[0,0] we invoke the Procedure with B1 = B[0,0], B2 = B[0,0] and B3 = B[0,0]. This recalculates all paths from vertices 0, 1 into 0, 1 through 0 and 1. It might look surprising a bit (to recalculating the block by the block itself) but if you remember, we inferred the code of the Procedure from the Floyd-Warshall algorithm, which means, when all input parameters are set to the same block, the Procedure completely resembles the Floyd-Warshall algorithm and recalculates paths within the block: function Procedure(B, B, B) do for k = 0 to L - 1 do for i = 0 to L - 1 do for j = 0 to L - 1 do B[i, j] = min(B[i, j], B[i, k] + B[k, j]) end for end for end for end function In combination, the process of recalculation of paths from vertices 4, 5 to vertices 6, 7 through vertices 0 and 1 is the following: Invoke Procedure with B1 = B[0,0], B2 = B[0,0] and B3 = B[0,0] to calculate all paths from vertices 0, 1 to vertices 0, 1 (represented by a block B[0,0]) through vertices 0 and 1 (see Picture 6a). Invoke Procedure with B1 = B[0,3], B2 = B[0,0] and B3 = B[0,3] to calculate all paths from vertices 0, 1 to vertices 6, 7 (represented by a block B[0,3]) through vertices 0 and 1 (see Picture 6b). Invoke Procedure with B1 = B[2,0], B2 = B[2,0] and B3 = B[0,0] to calculate all paths from vertices 4, 5 to vertices 0, 1 (represented by a block B[2,0]) through vertices 0 and 1 (see Picture 6c). Invoke Procedure with B1 = B[2,3], B2 = B[2,0] and B3 = B[0,3] to calculate all paths from vertices 4, 5 to vertices 6, 7 (represented by a block B[2,3]) through vertices 0 and 1 (see Picture 6d). In code, these steps can be represented by four sequential invocations of the Procedure: Procedure(B[0,0], B[0,0], B[0,0]) Procedure(B[0,3], B[0,0], B[0,3]) Procedure(B[2,0], B[2,0], B[0,0]) Procedure(B[2,3], B[2,0], B[0,3]) Interestingly, by slightly adjusting the above code we can recalculate paths from vertices 4,5 to 2,3 through 0 and 1 (all we need to do, is to replace B[0,3] with B[0,1]): Procedure(B[0,0], B[0,0], B[0,0]) Procedure(B[0,1], B[0,0], B[0,1]) Procedure(B[2,0], B[2,0], B[0,0]) Procedure(B[2,1], B[2,0], B[0,1]) … and if we replace B[0,1] with B[0,2] … I think you got the idea. We can continue and calculate all paths between all vertices through vertices 0 and 1 (see Picture 7): In code, the implementation is just a repetition of the same steps but for different blocks: // Recalculate block B[0,0] (aka "diagonal" block) // Procedure(B[0,0], B[0,0], B[0,0]) // // Recalculate blocks B[0,1], B[0,2] and B[0,3] (aka "horizontal" blocks) // for i = 1 to 4 do Procedure(B[0,i], B[0,0], B[0,i]) end // // Recalculate blocks B[1,0], B[2,0] and B[3,0] (aka "vertical" blocks) // for i = 1 to 4 do Procedure(B[i,0], B[i,0], B[0,0]) end // // Recalculate blocks: // B[1,1], B[1,2], B[1,3], // B[2,1], B[2,2], B[2,3], // B[3,1], B[3,2], B[3,3] // (aka "peripheral" blocks) // for i = 1 to 4 do for j = 1 to 4 do Procedure(B[i,j], B[i,0], B[0,j]) end end Plain and simple. This code recalculates all paths between all vertices through vertices 0 and 1. To recalculate remaining combinations of vertices we need to … repeat the above algorithm but starting from blocks B[1,1], B[2,2] and B[3,3] as “diagonal” blocks (see Picture 8): In code, the complete, Blocked Floyd-Warshall algorithm looks surprisingly simple: algorithm BlockedFloydWarshall(B) do // Iterate over all "diagonal" blocks // for m = 0 to M do // Recalculate "diagonal" block // Procedure(B[m,m], B[m,m], B[m,m]) // // Recalculate "horizontal" blocks // for i = 0 to M do // // Here, we jump over the "diagonal" block // if i != m then Procedure(B[m,i], B[m,m], B[m,i]) end end // // Recalculate "vertical" blocks // for i = 0 to M do // // Here, we jump over the "diagonal" block // if i != m then Procedure(B[i,m], B[i,m], B[m,m]) end end // // Recalculate "peripheral" blocks // for i = 0 to M do // // Here, we jump over the all "horizontal" blocks // if i != m then for j = 0 to 4 do // // Here, we jump over the all "vertical" blocks // if j != m then Procedure(B[i,j], B[i,m], B[m,j]) end end end end end end // where B - is a block matrix of M x M size The algorithm iterates over all “diagonal” blocks and sequentially recalculates all paths between all vertices through vertices of the “diagonal” block. Eventually, algorithm recalculates all paths between all vertices through all vertices. That is why both algorithms have the same space and time complexities of O(n2) and O(n3) respectively. - Author’s note Now, when we are done with the theory, it is time to implement it in C#. Experimental Environment All experiments in this post are executed on a single randomly generated complete graph of 4800 vertices. The experimental machine was equipped with 2xIntel Xeon CPU E5-2620 v4 (2.10GHz, 16 physical and 32 logical cores in total) under control of Windows Server 2019 operating system. We use the block of size 120×120 because it has been found to demonstrate best performance characteristics on the experimental hardware [3]. The code was compiled and executed used .NET SDK 8.0 and .NET Runtime 8.0 (x64, AVX2). You can find the all code in the GitHub repository. All implementations of Floyd-Warshall algorithm in this post are identical to the implementations from the previous post (except minors variable and method names). Basic Implementation (aka Baseline) We start the C# implementation from implementation of the Procedure: static void Procedure( Span<int> B1, Span<int> B2, Span<int> B3, int block_size) { for (var k = 0; k < block_size; ++k) { for (var i = 0; i < block_size; ++i) { for (var j = 0; j < block_size; ++j) { var distance = B2[i * block_size + k] + B3[k * block_size + j]; if (B1[i * block_size + j] > distance) { B1[i * block_size + j] = distance; } } } } } The code is almost an exact translation of previously demonstrated pseudo-code into C#. The only noticeable difference is the representation of all input blocks as a combination of Span<T> and size of the block. If you are a bit confused how it is possible and why we access an array by multiplying things, don’t be, I will explain it in a moment. - Author’s note The algorithm itself looks just a bit different from the pseudo-code (we denote this implementation as “Baseline”): private static void Baseline( int[] B, int block_count, int block_size) { var lineral_block_size = block_size * block_size; var lineral_block_row_size = block_count * lineral_block_size; for (var m = 0; m < block_count; ++m) { var offset_mm = m * lineral_block_row_size + m * lineral_block_size; var mm = new Span<int>(B, offset_mm, lineral_block_size); // // Calculate "diagonal" block // Procedure(mm, mm, mm, block_size); // // Calculate "horizontal" and "vertical" blocks in the same loop // for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var offset_mi = m * lineral_block_row_size + i * lineral_block_size; var im = new Span<int>(B, offset_im, lineral_block_size); var mi = new Span<int>(B, offset_mi, lineral_block_size); Procedure(im, im, mm, block_size); Procedure(mi, mm, mi, block_size); } } // // Calculate "peripheral" blocks // for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var im = new Span<int>(B, offset_im, lineral_block_size); for (var j = 0; j < block_count; ++j) { if (j != m) { var offset_ij = i * lineral_block_row_size + j * lineral_block_size; var offset_mj = m * lineral_block_row_size + j * lineral_block_size; var ij = new Span<int>(B, offset_ij, lineral_block_size); var mj = new Span<int>(B, offset_mj, lineral_block_size); Procedure(ij, im, mj, block_size); } } } } } } You definitely have noticed, that Baseline and Procedure calculate a lot of offsets, which are then used to create Span's and address values within the Span‘s. Here is how it works. Imagine a 8×8 square matrix W (see Picture 9a). In our heads, we see matrix as a square but in memory, we can represent “square” as an array of 8×8 = 64 values (this is called a “row-major” or sometimes “lineal” representation), where every value can be accessed using a simple formula: This is exactly what we do in the Procedure to access block values (we use block_size instead of 8 constant from the formula). When it comes to the block matrix, we can apply absolutely the same thinking. Imagine a 4×4 block matrix B where every block is a 2×2 square matrix (see Picture 9b). We can represent this matrix as an array of (2×2)x(4×4) = 4×16 = 64 values (again), where every block is put into array using the same logic we applied to matrix W, one by one. This allows us to find “beginning” of any block using the formula: You can see the 16 and 4 calculated inside the Baseline as lineral_block_row_size and lineral_block_size respectively. These manipulations are essential for algorithms performance. Lineal representation is extremely cache friendly (especially in our case, where we access elements sequentially most of the time). They also allow us to access any block of the matrix B (and any value in the any block of matrix B) in constant time, which is also health for performance. Now let’s see Baseline in action. Experimental Results Here are the experimental results of the Baseline implementation: Algorithm (Baseline) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 172.881 s 2.0421 s 1.9101 s 6,903,951,087 Blocked Floyd-Warshall 4800 (120) 200.075 s 0.0578 s 0.0540 s 59,423,676 Table 3. Experimental results of “Baseline” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms. Surprisingly, the Blocked Floyd-Warshall is about 15% slower than Floyd-Warshall. This is unexpected … or not? Think about the code of both algorithms. It is trivial – read three values from memory, sum two values, compare two values, and write one value back. Modern CPU’s are extremely optimised for single-threaded execution. They have multi-stage pipelines and can process multiple operations (with no data dependencies) simultaneously. They can detect lineal memory access (which is why we use it in both algorithms) patterns and automatically pre-load data from main memory into cache (hardware pre-fetch). All these factors united, produce the result when algorithm with cache-friend memory access (and if you take a look at LLC Miss / Op column which shows number Last Level Cache misses, it is obvious that Blocked Floyd-Warshall is more cache friendly) has no performance benefits or even worse, works slower because it does additional work to manipulate the data in the right way (for instance, instantiation of Span's, stack push-pop, extra loops and so on). We can run the same code under a profiler (for instance Intel VTune) and clearly see that none of the algorithm is “bound” (i.e. limited) by the memory: Algorithm (Baseline) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 364,956,900,000 1,771,270,200,000 0.206 3.0% Blocked Floyd-Warshall 449,603,700,000 1,925,290,500,000 0.234 0.0% Table 4. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Baseline” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms. I have extracted four (out of many) parameters: Clock Ticks – is the number of CPU clock ticks spent executing our algorithm. Instructions Retired – is the number of Instructions CPU processed. CPI Rate – is the number of Clock Ticks / Instructions Retired. Memory Bound – is the high-level, aggregative metric which represents impact of memory (when CPU is waiting for data to come up from memory) on the execution. You can see, that the number of Instructions Retired is slightly bigger for the Blocked Floyd-Warshall algorithm compared to Floyd-Warshall. This is expected because it does a bit more job to arrange calculations. They both have low CPI Rate, which means that every “cycle” (i.e. clock tick) the CPU was able to process up to five instructions. These two observations in combination with low Memory Bound basically tell us, that CPU had worked to its full capacity without stalling at all. Let’s see, if we can make the Blocked Floyd-Warshall algorithm to outperform Floyd-Warshall. Vectorised Implementation (aka “Vector”) The first thing we can do to improve performance of the purely CPU bound algorithm is to force CPU to do even more operations simultaneously. Yes, I am talking about vectorisation. In the previous post, you have already saw the vectorised implementation of Floyd-Warshall algorithm and if you didn’t, then don’t worry because it is almost a complete copy of the Procedure: static void Procedure( Span<int> B1, Span<int> B2, Span<int> B3, int block_size) { for (var k = 0; k < block_size; ++k) { for (var i = 0; i < block_size; ++i) { // Read vector of values from B2 // (to get existing paths from "i -> k") // var ik_vec = new Vector<int>(B2[i * block_size + k]); // // Vectorized loop // var j = 0; for (; j < block_size - Vector<int>.Count; j += Vector<int>.Count) { // Read vector of values from B1 // (to get existing paths "i -> j") // var ij_vec = new Vector<int>( B1.Slice(i * block_size + j, Vector<int>.Count)); // // Read vector of values from B3 // (to get existing paths "k -> j") // and sum them with values from B2 // (to get new paths "i -> k -> j") // var ikj_vec = new Vector<int>( B3.Slice(k * block_size + j, Vector<int>.Count)) + ik_vec; // // Compare vector of existing paths from B1 // with a vector of new paths // var lt_vec = Vector.LessThan(ij_vec, ikj_vec); if (lt_vec == new Vector<int>(-1)) { continue; } // // Copy all "i -> k -> j" paths which are smaller than // existing "i -> j" paths back into B1 block // var r_vec = Vector.ConditionalSelect(lt_vec, ij_vec, ikj_vec); r_vec.CopyTo(B1.Slice(i * block_size + j, Vector<int>.Count)); } // // Non-vectorized loop. // for (; j < block_size; ++j) { var distance = B2[i * block_size + k] + B3[k * block_size + j]; if (B1[i * block_size + j] > distance) { B1[i * block_size + j] = distance; } } } } } As previously, we implement vectorisation using – Vector and Vector<T> types. The implementation has two loops: one vectorised (to calculate majority of paths) and one non-vectorised (to calculate remaining paths when size of the block is not dividable by size of the vector). There is nothing to vectorise in the main algorithm and because interface of the Procedure doesn’t change, we can plug in the new code and run the experiments. Experimental Results Here are the experimental results of the Vector implementation: Algorithm (Vector) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 73.251 s 0.4993 s 0.4671 s 6,353,509,854 Blocked Floyd-Warshall 4800 (120) 64.792 s 0.0318 s 0.0282 s 50,703,019 Table 5. Experimental results of “Vector” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms. This time Blocked Floyd-Warshall is around 12% faster than Floyd-Warshall algorithm. Running the same code under the profiler also presents us with a different picture: Algorithm (Vector) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 161,668,500,000 490,026,600,000 0.330 6.8% Blocked Floyd-Warshall 144,314,100,000 488,193,300,000 0.296 0.3% Table 6. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Vector” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms We can see a significant reduction of the Instructions Retired in both algorithms (which is expected because vector instructions do multiple operations at once and CPU does multiple vector instructions per clock tick). We can also see the change in the Memory Bound – it is doubled for Floyd-Warshall and almost not changed for Blocked Floyd-Warshall algorithm. Still, the CPI Rate is low which means (again), the CPU is not stalling (almost). Therefore, there is a room for improvement. Let’s see, if we can apply parallelism here. Parallel-Vectorised Implementation (aka “Parallel Vector”) Parallelisation of Blocked Floyd-Warshall algorithm is a pure form of “Task Parallelism”, where “Task” is a recalculation of a single block (single invocation of the Procedure). The simplest approach for implementation is to use message passing (i.e. sending custom messages), CountdownEvent and ThreadPool. Here is the code of the custom message: private readonly struct ParallelMessage { // A reference to a countdown event to signal // when the message is processed // public readonly CountdownEvent Event; public readonly int[] Matrix; public readonly int OffsetB1; public readonly int OffsetB2; public readonly int OffsetB3; public readonly int BlockSize; public readonly int SpanLength; public ParallelMessage( CountdownEvent Event, int[] Matrix, int OffsetB1, int OffsetB2, int OffsetB3, int BlockSize, int SpanLength) { this.Event = Event; this.Matrix = Matrix; this.OffsetB1 = OffsetB1; this.OffsetB2 = OffsetB2; this.OffsetB3 = OffsetB3; this.BlockSize = BlockSize; this.SpanLength = SpanLength; } } The message is a readonly struct (for performance reasons, to avoid allocations) which includes all information required to initialise three Span's for B1, B2 and B3 blocks and invoke the Procedure. The process of conversion of ParallelMessage into the invocation of the Procedure is handled by an ParallelProcedure: static void ParallelProcedure( ParallelMessage message) { var B1 = new Span<int>(message.Matrix, message.OffsetB1, message.SpanLength); var B2 = new Span<int>(message.Matrix, message.OffsetB2, message.SpanLength); var B3 = new Span<int>(message.Matrix, message.OffsetB3, message.SpanLength); Procedure(B1, B2, B3, message.BlockSize); // Signal the message has been processed // message.Event.Signal(); } You can see, that besides information about the blocks, the ParallelMessage includes a reference to CountdownEvent, which ParallelProcedure signals when calculations are complete. Here is the code: private static void ParallelVectorOptimisations( int[] matrix, int block_count, int block_size) { var iteration_sync = new CountdownEvent(0); var lineral_block_size = block_size * block_size; var lineral_block_row_size = block_count * lineral_block_size; for (var m = 0; m < block_count; ++m) { var offset_mm = m * lineral_block_row_size + m * lineral_block_size; var mm = new Span<int>(matrix, offset_mm, lineral_block_size); Procedure(mm, mm, mm, block_size); // We calculate how many signals we expect to receive // (when all signals are received the event itself becomes signalled) // // We expect to have one row of blocks... // ("horizontal" blocks except diagonal block) // // ...and one column of blocks // ("vertical" blocks except diagonal block) // // Hence, 2 * block_count - 2 = 2 * (block_count - 1) // iteration_sync.Reset(2 * (block_count - 1)); for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var offset_mi = m * lineral_block_row_size + i * lineral_block_size; var message_x = new ParallelMessage( iteration_sync, matrix, offset_im, offset_im, offset_mm, block_size, lineral_block_size); var message_y = new ParallelMessage( iteration_sync, matrix, offset_mi, offset_mm, offset_mi, block_size, lineral_block_size); ThreadPool.QueueUserWorkItem(ParallelProcedure, message_x, false); ThreadPool.QueueUserWorkItem(ParallelProcedure, message_y, false); } } // Here we are waiting for all "horizontal" and "vertical" blocks // to be recalculated // iteration_sync.Wait(); // Now we expect all blocks except one column and one row // which essentially mean if we have a 4x4 matrix of blocks // we expect to recalculate 3x3 blocks // // Hence, (block_count - 1) * (block_count - 1) // iteration_sync.Reset((block_count - 1) * (block_count - 1)); for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; for (var j = 0; j < block_count; ++j) { if (j != m) { var offset_ij = i * lineral_block_row_size + j * lineral_block_size; var offset_mj = m * lineral_block_row_size + j * lineral_block_size; var message = new ParallelMessage( iteration_sync, matrix, offset_ij, offset_im, offset_mj, block_size, lineral_block_size); ThreadPool.QueueUserWorkItem(ParallelProcedure, message, false); } } } } // Waiting for all "peripheral" blocks to be recalculated // before moving to the next iteration, with next "diagonal" block // iteration_sync.Wait(); } } The implementation is straightforward (except, maybe the usage of CountdownEvent and .NET concurrency primitives). Let’s see how combination of vectorisation and parallelism impacts the performance. Experimental Results Here are the execution results: Algorithm (Parallel Vector) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 32.311 s 0.0542 s 0.0480 s 4,277,045,385 Blocked Floyd-Warshall 4800 (120) 3.396 s 0.0014 s 0.0013 s 90,435,311 Table 7. Experimental results of “Parallel Vector” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms. This isn’t a mistake. Introduction of the parallelism into Blocked Floyd-Warshall algorithm results in almost x20 performance improvement compared to vectorised version, while Floyd-Warshall speedup is much more moderate – around x2. Running the code under the profiles reveals the reasons: Algorithm (Vector) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 2,038,598,100,000 606,769,800,000 3.360 78.9% Blocked Floyd-Warshall 254,444,400,000 483,594,300,000 0.526 0.0% Table 8. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Vector” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms. The Memory Bound is crushing 78.9% – which essentially means, most of the time, the CPU was waiting for data from memory instead of doing any kind of calculations. This might seem unexpected because before the stats of both algorithms were quite close. But don’t forget the important part – modern CPU’s are extremely good at single-threaded execution but it is very different when it comes to multi-threading. Just look at the results of parallel (non-vectorised) version of Floyd-Warshall algorithm: Algorithm (Parallel) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 27.629 s 0.0101 s 0.0094 s 4,571,752,038 Table 9. Experimental results of “Parallel” implementation of Floyd-Warshall algorithm. It is FASTER (27.629 seconds vs 32.311 seconds) than vectorised version because it puts less pressure on memory: Algorithm (Parallel) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 1,907,472,000,000 2,449,542,900,000 0.779 30.5% Table 10. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Parallel” implementations of Floyd-Warshall algorithms. The Floyd-Warshall algorithm is bound by memory and its scaling is limited (even using more cores will result in just a tiny speedup). However, this doesn’t apply to Blocked Floyd-Warshall which actually CAN BE improved because currently it is not limited nor by CPU nor by memory. But this is a completely different story. Conclusion In this post we implemented one more algorithm for solving all-pairs shortest path problem. Along the way, we learned basics of caching, reiterated on linear memory access, vectorisation and basic concurrency capabilities of .NET. We also saw with our own eyes how fast can be cache-aware (aka cache-friendly) algorithms, especially when it comes to parallel execution. I hope you enjoyed the reading and THANK YOU for staying to the end. References You can find mentioned latency values and more on 7-Zip LZMA Benchmarks. Venkataraman, G., Sahni, S., Mukhopadhyaya, S. A Blocked All-Pairs Shortest Paths Algorithm // Journal of Experimental Algorithmics (JEA). Vol 8. 2003. P. 857 – 874. Karasik, O. N., and A. A. Prihozhy. “Tuning Block-Parallel All-Pairs Shortest Path Algorithm For Efficient Multi-Core Implementation.” Системный анализ и прикладная информатика 3 (2022): 57-65. In the previous posts ( first , second ) we have implemented a Floyd-Warshall algorithm (in four variants) and a routes reconstruction algorithm . In these posts we made our way through the basics of all-pairs shortest path problem, data representation in memory, parallelism, vectorisation and how we can adapt algorithms to data specifics. first second Floyd-Warshall algorithm routes reconstruction algorithm In this post we will continue our journey and explore a new, more efficient, way to solve all-pairs shortest path problem. However, this time, besides employing vector and parallel capabilities of the CPU we also will employ L1, L2 and L3 caches. Sounds interesting? Then let’s write some code 🙂 Well, not so fast - before exploring the theory and code behind the algorithm I suggest to spend a few moments and revise basic information about “CPU caches”. If you have a solid understanding about the topic you are free to skip it. In case you decided to read it (because you are just curious or you want to revise the basics) please take in account that the following information is an oversimplified overview. - Author’s note If you have a solid understanding about the topic you are free to skip it. In case you decided to read it (because you are just curious or you want to revise the basics) please take in account that the following information is an oversimplified overview. - Author’s note - Author’s note What is cache? Most probably you heard something about CPU cache, at least in advertisements of a new Intel or AMD chips or during any kind of hardware discussions with friends or colleagues. Here, we aren’t going to discuss what CPU cache exactly is or who is better in implementing it Intel, AMD or Apple. Instead, we will try to build a mental model of the cache and learn how to use it to: … better understand limits we can or can’t push when writing high-performance code in C# and .NET. … better understand application behaviour in high-performance scenarios on different platforms, with different parameters. … brag in front of our friend when someone says “cache”. … better understand limits we can or can’t push when writing high-performance code in C# and .NET. … better understand application behaviour in high-performance scenarios on different platforms, with different parameters. … brag in front of our friend when someone says “cache”. So, let’s start from the basics… what a heck is a cache? Imagine yourself sitting in an empty apartment in front of a PC. You are working on a new feature and you are doing great. You are blazing fast, no errors and no mistakes. You are in a perfect flow. You are the CPU. You are the CPU . CPU Periodically, you need to consult documentation. All documentation you need is available in the library across the road. Library is the main memory . Library is the main memory main memory You are free to enter the library anytime and read everything you need. However, every visit to the library takes time and prevents you from advancing your work, which doesn’t sound productive. To improve the situation, you started to write down pieces of the documentation on sticky notes and put them on the display. Sticky note on a display is the first level cache. Sticky note on a display is the first level cache . first level cache Now when you need something from the documentation you first look at sticky notes and only if information isn’t there you have to visit the library. When you found what you need on sticky note – it is called a cache hit, otherwise – a cache miss. When you found what you need on sticky note – it is called a cache hit , otherwise – a cache miss . cache hit cache miss Having sticky notes on the display reduced time spent on “library visits” but didn’t remove it completely because you can’t know what documentation you would need in advance. There is a natural limit of how much useful information you can bring back from the library. The amount of useful information you can bring from the library is called a cache line. The amount of useful information you can bring from the library is called a cache line . cache line You can’t have millions of sticky notes on the display, so when there is not enough space you have to throw some of them into trash. When you throw away sticky note because of space limitation – it is called an eviction from cache. When you throw away sticky note because of space limitation – it is called an eviction from cache . eviction from cache To stop throwing useful information away (and stop visiting library when you need it again) you bought yourself a “small box”. Now, when you bring “new” sticky notes from library and need to free up some space on the display, instead of throwing “old” sticky notes into trash you put them into a “small box”. This way, if you need this information once again, you will be able to pick it up from the “small box” and put it back on the display and save yourself a library visit. “Small box” is the second level cache. “Small box” is the second level cache . second level cache The “small box” can fit much more sticky notes than the display, however, it is not unlimited. Therefore, when some of sticky notes don’t fit in the “small box” you still had to throw them in the trash. That is why you decided to buy a “huge box”. Now, when you bring new sticky notes from the library you first make copies of them and put these copies into a “huge box” and only then you put originals on the display. If you need to replace sticky notes on the display, you as previously, put them into a “small box” and if “small box” is overloaded you throw them into trash. However, this time if you need this information once again – you just make a copy of the sticky note from the “huge box” and put it back on the display. “Huge box” is the third level cache (aka LLC) “Huge box” is the third level cache (aka LLC) third level cache (aka LLC) “Huge box” doesn’t eliminate “library visits”, however, if the size of a “huge box” is considerably large, then you will end up having a lot of related pieces of documentation in it, which in turn will allow you to do a significant amount of work without interruption. That is basically it. The above model isn’t full nor ideal but, at least I hope so, it clearly illustrates the main idea of why modern chips have multiple levels cache — to reduce number of round trips to main memory, and keep CPU busy as long as possible. What might not be so obvious, from the model, is that all caches (L1 “sticky notes on a display”, L2 “small box” and L3 “huge box”) have different access time (aka latency) [1] and size. Here is an example for Intel i7-6700 (Skylake) CPU: Intel i7-6700 (Skylake) Cache Access time (aka Latency) Size L1 4 cycles 64 KB L2 12 cycles 256 KB L3 42 cycles 8 MB Cache Access time (aka Latency) Size L1 4 cycles 64 KB L2 12 cycles 256 KB L3 42 cycles 8 MB Cache Access time (aka Latency) Size Cache Cache Access time (aka Latency) Access time (aka Latency) Size Size L1 4 cycles 64 KB L1 L1 4 cycles 4 cycles 64 KB 64 KB L2 12 cycles 256 KB L2 L2 12 cycles 12 cycles 256 KB 256 KB L3 42 cycles 8 MB L3 L3 42 cycles 42 cycles 8 MB 8 MB Table 1. L1, L2 and L3 caches access times and sizes of Intel i7-6700 (Skylake) processor. Intel i7-6700 (Skylake) Please note, the above numbers are just one example. Different CPUs have different characteristics. However, the overall relation more of less remains the same — L1 is the smallest but also the fastest one, L2 is somewhere in between, and L3 is the largest but also the slowest one. - Author’s note Please note, the above numbers are just one example. Different CPUs have different characteristics. However, the overall relation more of less remains the same — L1 is the smallest but also the fastest one, L2 is somewhere in between, and L3 is the largest but also the slowest one. - Author’s note - Author’s note To truly understand these numbers, it is critical to know how much time it takes to access main memory. On the same Intel i7-6700 (Skylake) CPU accessing main memory takes around 42 cycles + 51 nanosecond. You can notice, that besides cycles, we also have a fixed interval of “51 nanosecond” here. This isn’t a mistake because cycles can be converted (approximately) to nanoseconds using a very simple formula: In our example, it would be: Converting access time values in previous table from cycles to access time gives us the following: Element Access time (aka Latency) L1 1 ns L2 3 ns L3 10.5 ns Memory 10.5 ns + 51 ns = 60.5 ns Element Access time (aka Latency) L1 1 ns L2 3 ns L3 10.5 ns Memory 10.5 ns + 51 ns = 60.5 ns Element Access time (aka Latency) Element Element Access time (aka Latency) Access time (aka Latency) L1 1 ns L1 L1 1 ns 1 ns L2 3 ns L2 L2 3 ns 3 ns L3 10.5 ns L3 L3 10.5 ns 10.5 ns Memory 10.5 ns + 51 ns = 60.5 ns Memory Memory 10.5 ns + 51 ns = 60.5 ns 10.5 ns + 51 ns = 60.5 ns Table 2. L1, L2, L3 caches and main memory approximate access times in nanoseconds of Intel i7-6700 (Skylake) processor. Intel i7-6700 (Skylake) Even taking in account that this is a very approximate conversion, which ignores a lot of details, it still looks stunning — accessing L1 cache is 60 times faster than main memory. Of course in real application it isn’t so simple, and cache related optimisations doesn’t result in x60 performance improvements, however, as I hope we would be able to see in this post, they still can significantly improve application performance. A reader familiar with the topic for sure would notice that I haven’t mentioned anything about cache modes (exclusive and inclusive), as well as anything related to instruction caches and other things like memory alignment, hardware and software prefetch, and so on. I did that intentionally for simplicity purposes. However, if you just stumbled on this note and want to know more, you can start from this Wikipedia page and if you want to start thinking about low-level performance I can recommend you a book by Denis Bakhvalov and a paper by Ulrich Drepper - Author’s note A reader familiar with the topic for sure would notice that I haven’t mentioned anything about cache modes (exclusive and inclusive), as well as anything related to instruction caches and other things like memory alignment, hardware and software prefetch, and so on. I did that intentionally for simplicity purposes. However, if you just stumbled on this note and want to know more, you can start from this Wikipedia page and if you want to start thinking about low-level performance I can recommend you a book by Denis Bakhvalov and a paper by Ulrich Drepper page book paper - Author’s note - Author’s note Blocked Floyd-Warshall algorithm Blocked version of Floyd-Warshall algorithm (aka Blocked Floyd-Warshall) was introduced in [2]. The algorithm works with a graph G of vertices V , represented as weight matrix W . It splits the matrix into blocks of equal sizes (creating a new matrix of blocks B ) and uses these blocks to calculate all-pairs of shortest paths between all vertices in a graph. G V W B This definition doesn’t sound obvious, so here is an example. Imagine a 8×8 weight matrix W (see Picture 1a). We split the matrix into 2×2 blocks to create a new 4×4 matrix of blocks B (see Picture 1b). W B Every block of the matrix B includes a part of all paths of the matrix W , for instance (see Picture 2): B W B[0,0] includes paths from vertices 0 and 1 to 0 and 1 B[0,3] includes paths from vertices 0 and 1 to 6 and 7 B[2,0] includes paths from vertices 4 and 5 to 0 and 1 … and B[2,3] includes paths from vertices 4 and 5 to vertices 6 and 7 B[0,0] includes paths from vertices 0 and 1 to 0 and 1 B[0,0] B[0,3] includes paths from vertices 0 and 1 to 6 and 7 B[0,3] B[2,0] includes paths from vertices 4 and 5 to 0 and 1 B[2,0] … and B[2,3] includes paths from vertices 4 and 5 to vertices 6 and 7 B[2,3] All vertices, represented by these four blocks are tightly related. Just looking at the matrix B it is noticeable that you can move from vertex 4 to 0, then from 0 to 7 and by those find a path from 4 to 7 through vertex 0, which, then can be compared with the existing path from 4 to 7, which can be updated if necessary with the shortest one (see Picture 3). B If you remember the Floyd-Warshall algorithm, then these manipulations must sound extremely familiar because they resemble what it does: algorithm FloydWarshall(W) do for k = 0 to N - 1 do for i = 0 to N - 1 do for j = 0 to N - 1 do // Select the shortest path between existing path from i to j and // a new path from i to k to j, and replace the existing path // with the smallest // W[i, j] = min(W[i, j], W[i, k] + W[k, j]) end for end for end for end algorithm // where W - is a weight matrix of N x N size, // min() - is a function which returns lesser of it's arguments algorithm FloydWarshall(W) do for k = 0 to N - 1 do for i = 0 to N - 1 do for j = 0 to N - 1 do // Select the shortest path between existing path from i to j and // a new path from i to k to j, and replace the existing path // with the smallest // W[i, j] = min(W[i, j], W[i, k] + W[k, j]) end for end for end for end algorithm // where W - is a weight matrix of N x N size, // min() - is a function which returns lesser of it's arguments However, Floyd-Warshall algorithm works with matrix W (not B ). Fortunately, it can be easily rewritten into a Procedure which accepts three blocks B1 , B2 and B3 instead of matrix W : W B Procedure B1 B2 B3 W function Procedure(B1, B2, B3) do for k = 0 to L - 1 do for i = 0 to L - 1 do for j = 0 to L - 1 do B1[i, j] = min(B1[i, j], B2[i, k] + B3[k, j]) end for end for end for end function // where B1, B2 and B3 - are blocks from a block matrix B, // L - is a size of the blocks (B1, B2 and B3 are L x L size) // min() - is a function which returns lesser of it's arguments function Procedure(B1, B2, B3) do for k = 0 to L - 1 do for i = 0 to L - 1 do for j = 0 to L - 1 do B1[i, j] = min(B1[i, j], B2[i, k] + B3[k, j]) end for end for end for end function // where B1, B2 and B3 - are blocks from a block matrix B, // L - is a size of the blocks (B1, B2 and B3 are L x L size) // min() - is a function which returns lesser of it's arguments If we invoke the the Procedure with B1 = B[2,3] , B2 = B[2,0] and B3 = B[0,3] , it will recalculate all paths from vertices 4, 5 to vertices 6, 7 through vertices 0, 1. Procedure B1 = B[2,3] B2 = B[2,0] B3 = B[0,3] The important moment here is while Procedure indeed recalculates paths from vertices 4, 5 to vertices 6, 7 (through 0 and 1), these paths aren’t guaranteed to be THE SHORTEST because recalculation relies on the existing paths stored in blocks B[2,0] and B[0,3] . Procedure B[2,0] B[0,3] Luckily, prior to recalculating the paths in block B[2,3] , we can recalculate paths in blocks B[2,0] and B[0,3] using … the same Procedure but with different input parameters (see Picture 4): B[2,3] B[2,0] B[0,3] Procedure To recalculate B[2,0] we invoke the Procedure with B1 = B[2,0], B2 = B[2,0] and B3 = B[0,0]. This recalculates all paths from vertices 4, 5 into 0, 1 through 0 and 1. To recalculate B[0,3] we invoke the Procedure with B1 = B[0,3], B2 = B[0,0] and B3 = B[0,3]. This recalculates all paths from vertices 0, 1 into 6, 7 through 0 and 1. To recalculate B[2,0] we invoke the Procedure with B1 = B[2,0] , B2 = B[2,0] and B3 = B[0,0] . This recalculates all paths from vertices 4, 5 into 0, 1 through 0 and 1. B[2,0] Procedure B1 = B[2,0] B2 = B[2,0] B3 = B[0,0] To recalculate B[0,3] we invoke the Procedure with B1 = B[0,3] , B2 = B[0,0] and B3 = B[0,3] . This recalculates all paths from vertices 0, 1 into 6, 7 through 0 and 1. B[0,3] Procedure B1 = B[0,3] B2 = B[0,0] B3 = B[0,3] You might have already noticed — recalculation of blocks B[2,0] and B[0,3] relies on the data from block B[0,0] in the same way as recalculation of block B[2,3] relies on blocks B[2,0] and B[0,3] . B[2,0] B[0,3] B[0,0] B[2,3] B[2,0] B[0,3] Luckily (again), prior to recalculating the paths in blocks B[2,0] and B[0,3] , we can recalculate paths in B[0,0] using … the same Procedure but (again) with different input parameters (see Picture 5): B[2,0] B[0,3] B[0,0] Procedure To recalculate B[0,0] we invoke the Procedure with B1 = B[0,0], B2 = B[0,0] and B3 = B[0,0]. This recalculates all paths from vertices 0, 1 into 0, 1 through 0 and 1. To recalculate B[0,0] we invoke the Procedure with B1 = B[0,0] , B2 = B[0,0] and B3 = B[0,0] . This recalculates all paths from vertices 0, 1 into 0, 1 through 0 and 1. B[0,0] B1 = B[0,0] B2 = B[0,0] B3 = B[0,0] It might look surprising a bit (to recalculating the block by the block itself) but if you remember, we inferred the code of the Procedure from the Floyd-Warshall algorithm, which means, when all input parameters are set to the same block, the Procedure completely resembles the Floyd-Warshall algorithm and recalculates paths within the block: Procedure Procedure function Procedure(B, B, B) do for k = 0 to L - 1 do for i = 0 to L - 1 do for j = 0 to L - 1 do B[i, j] = min(B[i, j], B[i, k] + B[k, j]) end for end for end for end function function Procedure(B, B, B) do for k = 0 to L - 1 do for i = 0 to L - 1 do for j = 0 to L - 1 do B[i, j] = min(B[i, j], B[i, k] + B[k, j]) end for end for end for end function In combination, the process of recalculation of paths from vertices 4, 5 to vertices 6, 7 through vertices 0 and 1 is the following: Invoke Procedure with B1 = B[0,0], B2 = B[0,0] and B3 = B[0,0] to calculate all paths from vertices 0, 1 to vertices 0, 1 (represented by a block B[0,0]) through vertices 0 and 1 (see Picture 6a). Invoke Procedure with B1 = B[0,3], B2 = B[0,0] and B3 = B[0,3] to calculate all paths from vertices 0, 1 to vertices 6, 7 (represented by a block B[0,3]) through vertices 0 and 1 (see Picture 6b). Invoke Procedure with B1 = B[2,0], B2 = B[2,0] and B3 = B[0,0] to calculate all paths from vertices 4, 5 to vertices 0, 1 (represented by a block B[2,0]) through vertices 0 and 1 (see Picture 6c). Invoke Procedure with B1 = B[2,3], B2 = B[2,0] and B3 = B[0,3] to calculate all paths from vertices 4, 5 to vertices 6, 7 (represented by a block B[2,3]) through vertices 0 and 1 (see Picture 6d). Invoke Procedure with B1 = B[0,0] , B2 = B[0,0] and B3 = B[0,0] to calculate all paths from vertices 0, 1 to vertices 0, 1 (represented by a block B[0,0] ) through vertices 0 and 1 (see Picture 6a). Procedure B1 = B[0,0] B2 = B[0,0] B3 = B[0,0] B[0,0] Invoke Procedure with B1 = B[0,3], B2 = B[0,0] and B3 = B[0,3] to calculate all paths from vertices 0, 1 to vertices 6, 7 (represented by a block B[0,3] ) through vertices 0 and 1 (see Picture 6b). Procedure B[0,3] Invoke Procedure with B1 = B[2,0] , B2 = B[2,0] and B3 = B[0,0] to calculate all paths from vertices 4, 5 to vertices 0, 1 (represented by a block B[2,0] ) through vertices 0 and 1 (see Picture 6c). Procedure B1 = B[2,0] B2 = B[2,0] B3 = B[0,0] B[2,0] Invoke Procedure with B1 = B[2,3], B2 = B[2,0] and B3 = B[0,3] to calculate all paths from vertices 4, 5 to vertices 6, 7 (represented by a block B[2,3] ) through vertices 0 and 1 (see Picture 6d). Procedure B[2,3] In code, these steps can be represented by four sequential invocations of the Procedure : Procedure Procedure(B[0,0], B[0,0], B[0,0]) Procedure(B[0,3], B[0,0], B[0,3]) Procedure(B[2,0], B[2,0], B[0,0]) Procedure(B[2,3], B[2,0], B[0,3]) Procedure(B[0,0], B[0,0], B[0,0]) Procedure(B[0,3], B[0,0], B[0,3]) Procedure(B[2,0], B[2,0], B[0,0]) Procedure(B[2,3], B[2,0], B[0,3]) Interestingly, by slightly adjusting the above code we can recalculate paths from vertices 4,5 to 2,3 through 0 and 1 (all we need to do, is to replace B[0,3 ] with B[0,1] ): B[0,3 B[0,1] Procedure(B[0,0], B[0,0], B[0,0]) Procedure(B[0,1], B[0,0], B[0,1]) Procedure(B[2,0], B[2,0], B[0,0]) Procedure(B[2,1], B[2,0], B[0,1]) Procedure(B[0,0], B[0,0], B[0,0]) Procedure(B[0,1], B[0,0], B[0,1]) Procedure(B[2,0], B[2,0], B[0,0]) Procedure(B[2,1], B[2,0], B[0,1]) … and if we replace B[0,1] with B[0,2] … I think you got the idea. We can continue and calculate all paths between all vertices through vertices 0 and 1 (see Picture 7): B[0,1] B[0,2] In code, the implementation is just a repetition of the same steps but for different blocks: // Recalculate block B[0,0] (aka "diagonal" block) // Procedure(B[0,0], B[0,0], B[0,0]) // // Recalculate blocks B[0,1], B[0,2] and B[0,3] (aka "horizontal" blocks) // for i = 1 to 4 do Procedure(B[0,i], B[0,0], B[0,i]) end // // Recalculate blocks B[1,0], B[2,0] and B[3,0] (aka "vertical" blocks) // for i = 1 to 4 do Procedure(B[i,0], B[i,0], B[0,0]) end // // Recalculate blocks: // B[1,1], B[1,2], B[1,3], // B[2,1], B[2,2], B[2,3], // B[3,1], B[3,2], B[3,3] // (aka "peripheral" blocks) // for i = 1 to 4 do for j = 1 to 4 do Procedure(B[i,j], B[i,0], B[0,j]) end end // Recalculate block B[0,0] (aka "diagonal" block) // Procedure(B[0,0], B[0,0], B[0,0]) // // Recalculate blocks B[0,1], B[0,2] and B[0,3] (aka "horizontal" blocks) // for i = 1 to 4 do Procedure(B[0,i], B[0,0], B[0,i]) end // // Recalculate blocks B[1,0], B[2,0] and B[3,0] (aka "vertical" blocks) // for i = 1 to 4 do Procedure(B[i,0], B[i,0], B[0,0]) end // // Recalculate blocks: // B[1,1], B[1,2], B[1,3], // B[2,1], B[2,2], B[2,3], // B[3,1], B[3,2], B[3,3] // (aka "peripheral" blocks) // for i = 1 to 4 do for j = 1 to 4 do Procedure(B[i,j], B[i,0], B[0,j]) end end Plain and simple. This code recalculates all paths between all vertices through vertices 0 and 1. To recalculate remaining combinations of vertices we need to … repeat the above algorithm but starting from blocks B[1,1] , B[2,2] and B[3,3] as “diagonal” blocks (see Picture 8): B[1,1] B[2,2] B[3,3] In code, the complete, Blocked Floyd-Warshall algorithm looks surprisingly simple: algorithm BlockedFloydWarshall(B) do // Iterate over all "diagonal" blocks // for m = 0 to M do // Recalculate "diagonal" block // Procedure(B[m,m], B[m,m], B[m,m]) // // Recalculate "horizontal" blocks // for i = 0 to M do // // Here, we jump over the "diagonal" block // if i != m then Procedure(B[m,i], B[m,m], B[m,i]) end end // // Recalculate "vertical" blocks // for i = 0 to M do // // Here, we jump over the "diagonal" block // if i != m then Procedure(B[i,m], B[i,m], B[m,m]) end end // // Recalculate "peripheral" blocks // for i = 0 to M do // // Here, we jump over the all "horizontal" blocks // if i != m then for j = 0 to 4 do // // Here, we jump over the all "vertical" blocks // if j != m then Procedure(B[i,j], B[i,m], B[m,j]) end end end end end end // where B - is a block matrix of M x M size algorithm BlockedFloydWarshall(B) do // Iterate over all "diagonal" blocks // for m = 0 to M do // Recalculate "diagonal" block // Procedure(B[m,m], B[m,m], B[m,m]) // // Recalculate "horizontal" blocks // for i = 0 to M do // // Here, we jump over the "diagonal" block // if i != m then Procedure(B[m,i], B[m,m], B[m,i]) end end // // Recalculate "vertical" blocks // for i = 0 to M do // // Here, we jump over the "diagonal" block // if i != m then Procedure(B[i,m], B[i,m], B[m,m]) end end // // Recalculate "peripheral" blocks // for i = 0 to M do // // Here, we jump over the all "horizontal" blocks // if i != m then for j = 0 to 4 do // // Here, we jump over the all "vertical" blocks // if j != m then Procedure(B[i,j], B[i,m], B[m,j]) end end end end end end // where B - is a block matrix of M x M size The algorithm iterates over all “diagonal” blocks and sequentially recalculates all paths between all vertices through vertices of the “diagonal” block. Eventually, algorithm recalculates all paths between all vertices through all vertices. That is why both algorithms have the same space and time complexities of O(n2) and O(n3) respectively. - Author’s note That is why both algorithms have the same space and time complexities of O(n2) and O(n3) respectively. - Author’s note - Author’s note Now, when we are done with the theory, it is time to implement it in C#. Experimental Environment All experiments in this post are executed on a single randomly generated complete graph of 4800 vertices. The experimental machine was equipped with 2xIntel Xeon CPU E5-2620 v4 (2.10GHz, 16 physical and 32 logical cores in total) under control of Windows Server 2019 operating system. We use the block of size 120×120 because it has been found to demonstrate best performance characteristics on the experimental hardware [3]. The code was compiled and executed used .NET SDK 8.0 and .NET Runtime 8.0 (x64, AVX2). You can find the all code in the GitHub repository . repository All implementations of Floyd-Warshall algorithm in this post are identical to the implementations from the previous post (except minors variable and method names). previous post Basic Implementation (aka Baseline) We start the C# implementation from implementation of the Procedure : Procedure static void Procedure( Span<int> B1, Span<int> B2, Span<int> B3, int block_size) { for (var k = 0; k < block_size; ++k) { for (var i = 0; i < block_size; ++i) { for (var j = 0; j < block_size; ++j) { var distance = B2[i * block_size + k] + B3[k * block_size + j]; if (B1[i * block_size + j] > distance) { B1[i * block_size + j] = distance; } } } } } static void Procedure( Span<int> B1, Span<int> B2, Span<int> B3, int block_size) { for (var k = 0; k < block_size; ++k) { for (var i = 0; i < block_size; ++i) { for (var j = 0; j < block_size; ++j) { var distance = B2[i * block_size + k] + B3[k * block_size + j]; if (B1[i * block_size + j] > distance) { B1[i * block_size + j] = distance; } } } } } The code is almost an exact translation of previously demonstrated pseudo-code into C#. The only noticeable difference is the representation of all input blocks as a combination of Span<T> and size of the block. Span<T> If you are a bit confused how it is possible and why we access an array by multiplying things, don’t be, I will explain it in a moment. - Author’s note If you are a bit confused how it is possible and why we access an array by multiplying things, don’t be, I will explain it in a moment. - Author’s note - Author’s note The algorithm itself looks just a bit different from the pseudo-code (we denote this implementation as “Baseline”): private static void Baseline( int[] B, int block_count, int block_size) { var lineral_block_size = block_size * block_size; var lineral_block_row_size = block_count * lineral_block_size; for (var m = 0; m < block_count; ++m) { var offset_mm = m * lineral_block_row_size + m * lineral_block_size; var mm = new Span<int>(B, offset_mm, lineral_block_size); // // Calculate "diagonal" block // Procedure(mm, mm, mm, block_size); // // Calculate "horizontal" and "vertical" blocks in the same loop // for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var offset_mi = m * lineral_block_row_size + i * lineral_block_size; var im = new Span<int>(B, offset_im, lineral_block_size); var mi = new Span<int>(B, offset_mi, lineral_block_size); Procedure(im, im, mm, block_size); Procedure(mi, mm, mi, block_size); } } // // Calculate "peripheral" blocks // for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var im = new Span<int>(B, offset_im, lineral_block_size); for (var j = 0; j < block_count; ++j) { if (j != m) { var offset_ij = i * lineral_block_row_size + j * lineral_block_size; var offset_mj = m * lineral_block_row_size + j * lineral_block_size; var ij = new Span<int>(B, offset_ij, lineral_block_size); var mj = new Span<int>(B, offset_mj, lineral_block_size); Procedure(ij, im, mj, block_size); } } } } } } private static void Baseline( int[] B, int block_count, int block_size) { var lineral_block_size = block_size * block_size; var lineral_block_row_size = block_count * lineral_block_size; for (var m = 0; m < block_count; ++m) { var offset_mm = m * lineral_block_row_size + m * lineral_block_size; var mm = new Span<int>(B, offset_mm, lineral_block_size); // // Calculate "diagonal" block // Procedure(mm, mm, mm, block_size); // // Calculate "horizontal" and "vertical" blocks in the same loop // for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var offset_mi = m * lineral_block_row_size + i * lineral_block_size; var im = new Span<int>(B, offset_im, lineral_block_size); var mi = new Span<int>(B, offset_mi, lineral_block_size); Procedure(im, im, mm, block_size); Procedure(mi, mm, mi, block_size); } } // // Calculate "peripheral" blocks // for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var im = new Span<int>(B, offset_im, lineral_block_size); for (var j = 0; j < block_count; ++j) { if (j != m) { var offset_ij = i * lineral_block_row_size + j * lineral_block_size; var offset_mj = m * lineral_block_row_size + j * lineral_block_size; var ij = new Span<int>(B, offset_ij, lineral_block_size); var mj = new Span<int>(B, offset_mj, lineral_block_size); Procedure(ij, im, mj, block_size); } } } } } } You definitely have noticed, that Baseline and Procedure calculate a lot of offsets, which are then used to create Span's and address values within the Span ‘s. Baseline Procedure Span's Span Here is how it works. Imagine a 8×8 square matrix W (see Picture 9a). In our heads, we see matrix as a square but in memory, we can represent “square” as an array of 8×8 = 64 values (this is called a “row-major” or sometimes “lineal” representation), where every value can be accessed using a simple formula: W This is exactly what we do in the Procedure to access block values (we use block_size instead of 8 constant from the formula). Procedure block_size 8 When it comes to the block matrix, we can apply absolutely the same thinking. Imagine a 4×4 block matrix B where every block is a 2×2 square matrix (see Picture 9b). We can represent this matrix as an array of (2×2)x(4×4) = 4×16 = 64 values (again), where every block is put into array using the same logic we applied to matrix W , one by one. This allows us to find “beginning” of any block using the formula: W You can see the 16 and 4 calculated inside the Baseline as lineral_block_row_size and lineral_block_size respectively. 16 4 Baseline lineral_block_row_size lineral_block_size These manipulations are essential for algorithms performance. Lineal representation is extremely cache friendly (especially in our case, where we access elements sequentially most of the time). They also allow us to access any block of the matrix B (and any value in the any block of matrix B ) in constant time, which is also health for performance. B B Now let’s see Baseline in action. Experimental Results Here are the experimental results of the Baseline implementation: Algorithm (Baseline) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 172.881 s 2.0421 s 1.9101 s 6,903,951,087 Blocked Floyd-Warshall 4800 (120) 200.075 s 0.0578 s 0.0540 s 59,423,676 Algorithm (Baseline) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 172.881 s 2.0421 s 1.9101 s 6,903,951,087 Blocked Floyd-Warshall 4800 (120) 200.075 s 0.0578 s 0.0540 s 59,423,676 Algorithm (Baseline) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Algorithm (Baseline) Algorithm (Baseline) Graph Graph Mean (s) Mean (s) Error (s) Error (s) StdDev (s) StdDev (s) LLC Miss / Op LLC Miss / Op Floyd-Warshall 4800 172.881 s 2.0421 s 1.9101 s 6,903,951,087 Floyd-Warshall Floyd-Warshall 4800 4800 172.881 s 172.881 s 2.0421 s 2.0421 s 1.9101 s 1.9101 s 6,903,951,087 6,903,951,087 Blocked Floyd-Warshall 4800 (120) 200.075 s 0.0578 s 0.0540 s 59,423,676 Blocked Floyd-Warshall Blocked Floyd-Warshall 4800 (120) 4800 (120) 200.075 s 200.075 s 0.0578 s 0.0578 s 0.0540 s 0.0540 s 59,423,676 59,423,676 Table 3. Experimental results of “Baseline” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms. Surprisingly, the Blocked Floyd-Warshall is about 15% slower than Floyd-Warshall. This is unexpected … or not? Think about the code of both algorithms. It is trivial – read three values from memory, sum two values, compare two values, and write one value back. Modern CPU’s are extremely optimised for single-threaded execution. They have multi-stage pipelines and can process multiple operations (with no data dependencies) simultaneously. They can detect lineal memory access (which is why we use it in both algorithms) patterns and automatically pre-load data from main memory into cache (hardware pre-fetch). All these factors united, produce the result when algorithm with cache-friend memory access (and if you take a look at LLC Miss / Op column which shows number Last Level Cache misses, it is obvious that Blocked Floyd-Warshall is more cache friendly) has no performance benefits or even worse, works slower because it does additional work to manipulate the data in the right way (for instance, instantiation of Span's , stack push-pop, extra loops and so on). LLC Miss / Op Span's We can run the same code under a profiler (for instance Intel VTune) and clearly see that none of the algorithm is “bound” (i.e. limited) by the memory: Algorithm (Baseline) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 364,956,900,000 1,771,270,200,000 0.206 3.0% Blocked Floyd-Warshall 449,603,700,000 1,925,290,500,000 0.234 0.0% Algorithm (Baseline) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 364,956,900,000 1,771,270,200,000 0.206 3.0% Blocked Floyd-Warshall 449,603,700,000 1,925,290,500,000 0.234 0.0% Algorithm (Baseline) Clock Ticks Instructions Retired CPI Rate Memory Bound Algorithm (Baseline) Algorithm (Baseline) Clock Ticks Clock Ticks Instructions Retired Instructions Retired CPI Rate CPI Rate Memory Bound Memory Bound Floyd-Warshall 364,956,900,000 1,771,270,200,000 0.206 3.0% Floyd-Warshall Floyd-Warshall 364,956,900,000 364,956,900,000 1,771,270,200,000 1,771,270,200,000 0.206 0.206 3.0% 3.0% Blocked Floyd-Warshall 449,603,700,000 1,925,290,500,000 0.234 0.0% Blocked Floyd-Warshall Blocked Floyd-Warshall 449,603,700,000 449,603,700,000 1,925,290,500,000 1,925,290,500,000 0.234 0.234 0.0% 0.0% Table 4. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Baseline” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms. I have extracted four (out of many) parameters: Clock Ticks – is the number of CPU clock ticks spent executing our algorithm. Instructions Retired – is the number of Instructions CPU processed. CPI Rate – is the number of Clock Ticks / Instructions Retired. Memory Bound – is the high-level, aggregative metric which represents impact of memory (when CPU is waiting for data to come up from memory) on the execution. Clock Ticks – is the number of CPU clock ticks spent executing our algorithm. Clock Ticks Instructions Retired – is the number of Instructions CPU processed. Instructions Retired CPI Rate – is the number of Clock Ticks / Instructions Retired. CPI Rate Memory Bound – is the high-level, aggregative metric which represents impact of memory (when CPU is waiting for data to come up from memory) on the execution. Memory Bound You can see, that the number of Instructions Retired is slightly bigger for the Blocked Floyd-Warshall algorithm compared to Floyd-Warshall. This is expected because it does a bit more job to arrange calculations. They both have low CPI Rate , which means that every “cycle” (i.e. clock tick) the CPU was able to process up to five instructions. Instructions Retired CPI Rate These two observations in combination with low Memory Bound basically tell us, that CPU had worked to its full capacity without stalling at all. Memory Bound Let’s see, if we can make the Blocked Floyd-Warshall algorithm to outperform Floyd-Warshall. Vectorised Implementation (aka “Vector”) The first thing we can do to improve performance of the purely CPU bound algorithm is to force CPU to do even more operations simultaneously. Yes, I am talking about vectorisation. In the previous post , you have already saw the vectorised implementation of Floyd-Warshall algorithm and if you didn’t, then don’t worry because it is almost a complete copy of the Procedure : previous post Procedure static void Procedure( Span<int> B1, Span<int> B2, Span<int> B3, int block_size) { for (var k = 0; k < block_size; ++k) { for (var i = 0; i < block_size; ++i) { // Read vector of values from B2 // (to get existing paths from "i -> k") // var ik_vec = new Vector<int>(B2[i * block_size + k]); // // Vectorized loop // var j = 0; for (; j < block_size - Vector<int>.Count; j += Vector<int>.Count) { // Read vector of values from B1 // (to get existing paths "i -> j") // var ij_vec = new Vector<int>( B1.Slice(i * block_size + j, Vector<int>.Count)); // // Read vector of values from B3 // (to get existing paths "k -> j") // and sum them with values from B2 // (to get new paths "i -> k -> j") // var ikj_vec = new Vector<int>( B3.Slice(k * block_size + j, Vector<int>.Count)) + ik_vec; // // Compare vector of existing paths from B1 // with a vector of new paths // var lt_vec = Vector.LessThan(ij_vec, ikj_vec); if (lt_vec == new Vector<int>(-1)) { continue; } // // Copy all "i -> k -> j" paths which are smaller than // existing "i -> j" paths back into B1 block // var r_vec = Vector.ConditionalSelect(lt_vec, ij_vec, ikj_vec); r_vec.CopyTo(B1.Slice(i * block_size + j, Vector<int>.Count)); } // // Non-vectorized loop. // for (; j < block_size; ++j) { var distance = B2[i * block_size + k] + B3[k * block_size + j]; if (B1[i * block_size + j] > distance) { B1[i * block_size + j] = distance; } } } } } static void Procedure( Span<int> B1, Span<int> B2, Span<int> B3, int block_size) { for (var k = 0; k < block_size; ++k) { for (var i = 0; i < block_size; ++i) { // Read vector of values from B2 // (to get existing paths from "i -> k") // var ik_vec = new Vector<int>(B2[i * block_size + k]); // // Vectorized loop // var j = 0; for (; j < block_size - Vector<int>.Count; j += Vector<int>.Count) { // Read vector of values from B1 // (to get existing paths "i -> j") // var ij_vec = new Vector<int>( B1.Slice(i * block_size + j, Vector<int>.Count)); // // Read vector of values from B3 // (to get existing paths "k -> j") // and sum them with values from B2 // (to get new paths "i -> k -> j") // var ikj_vec = new Vector<int>( B3.Slice(k * block_size + j, Vector<int>.Count)) + ik_vec; // // Compare vector of existing paths from B1 // with a vector of new paths // var lt_vec = Vector.LessThan(ij_vec, ikj_vec); if (lt_vec == new Vector<int>(-1)) { continue; } // // Copy all "i -> k -> j" paths which are smaller than // existing "i -> j" paths back into B1 block // var r_vec = Vector.ConditionalSelect(lt_vec, ij_vec, ikj_vec); r_vec.CopyTo(B1.Slice(i * block_size + j, Vector<int>.Count)); } // // Non-vectorized loop. // for (; j < block_size; ++j) { var distance = B2[i * block_size + k] + B3[k * block_size + j]; if (B1[i * block_size + j] > distance) { B1[i * block_size + j] = distance; } } } } } As previously, we implement vectorisation using – Vector and Vector<T> types. The implementation has two loops: one vectorised (to calculate majority of paths) and one non-vectorised (to calculate remaining paths when size of the block is not dividable by size of the vector). Vector Vector<T> There is nothing to vectorise in the main algorithm and because interface of the Procedure doesn’t change, we can plug in the new code and run the experiments. Procedure Experimental Results Here are the experimental results of the Vector implementation: Algorithm (Vector) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 73.251 s 0.4993 s 0.4671 s 6,353,509,854 Blocked Floyd-Warshall 4800 (120) 64.792 s 0.0318 s 0.0282 s 50,703,019 Algorithm (Vector) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 73.251 s 0.4993 s 0.4671 s 6,353,509,854 Blocked Floyd-Warshall 4800 (120) 64.792 s 0.0318 s 0.0282 s 50,703,019 Algorithm (Vector) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Algorithm (Vector) Algorithm (Vector) Graph Graph Mean (s) Mean (s) Error (s) Error (s) StdDev (s) StdDev (s) LLC Miss / Op LLC Miss / Op Floyd-Warshall 4800 73.251 s 0.4993 s 0.4671 s 6,353,509,854 Floyd-Warshall Floyd-Warshall 4800 4800 73.251 s 73.251 s 0.4993 s 0.4993 s 0.4671 s 0.4671 s 6,353,509,854 6,353,509,854 Blocked Floyd-Warshall 4800 (120) 64.792 s 0.0318 s 0.0282 s 50,703,019 Blocked Floyd-Warshall Blocked Floyd-Warshall 4800 (120) 4800 (120) 64.792 s 64.792 s 0.0318 s 0.0318 s 0.0282 s 0.0282 s 50,703,019 50,703,019 Table 5. Experimental results of “Vector” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms. This time Blocked Floyd-Warshall is around 12% faster than Floyd-Warshall algorithm. Running the same code under the profiler also presents us with a different picture: Algorithm (Vector) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 161,668,500,000 490,026,600,000 0.330 6.8% Blocked Floyd-Warshall 144,314,100,000 488,193,300,000 0.296 0.3% Algorithm (Vector) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 161,668,500,000 490,026,600,000 0.330 6.8% Blocked Floyd-Warshall 144,314,100,000 488,193,300,000 0.296 0.3% Algorithm (Vector) Clock Ticks Instructions Retired CPI Rate Memory Bound Algorithm (Vector) Algorithm (Vector) Clock Ticks Clock Ticks Instructions Retired Instructions Retired CPI Rate CPI Rate Memory Bound Memory Bound Floyd-Warshall 161,668,500,000 490,026,600,000 0.330 6.8% Floyd-Warshall Floyd-Warshall 161,668,500,000 161,668,500,000 490,026,600,000 490,026,600,000 0.330 0.330 6.8% 6.8% Blocked Floyd-Warshall 144,314,100,000 488,193,300,000 0.296 0.3% Blocked Floyd-Warshall Blocked Floyd-Warshall 144,314,100,000 144,314,100,000 488,193,300,000 488,193,300,000 0.296 0.296 0.3% 0.3% Table 6. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Vector” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms We can see a significant reduction of the Instructions Retired in both algorithms (which is expected because vector instructions do multiple operations at once and CPU does multiple vector instructions per clock tick). Instructions Retired We can also see the change in the Memory Bound – it is doubled for Floyd-Warshall and almost not changed for Blocked Floyd-Warshall algorithm. Memory Bound Still, the CPI Rate is low which means (again), the CPU is not stalling (almost). Therefore, there is a room for improvement. Let’s see, if we can apply parallelism here. Parallel-Vectorised Implementation (aka “Parallel Vector”) Parallelisation of Blocked Floyd-Warshall algorithm is a pure form of “Task Parallelism”, where “Task” is a recalculation of a single block (single invocation of the Procedure ). Procedure The simplest approach for implementation is to use message passing (i.e. sending custom messages), CountdownEvent and ThreadPool . CountdownEvent ThreadPool Here is the code of the custom message: private readonly struct ParallelMessage { // A reference to a countdown event to signal // when the message is processed // public readonly CountdownEvent Event; public readonly int[] Matrix; public readonly int OffsetB1; public readonly int OffsetB2; public readonly int OffsetB3; public readonly int BlockSize; public readonly int SpanLength; public ParallelMessage( CountdownEvent Event, int[] Matrix, int OffsetB1, int OffsetB2, int OffsetB3, int BlockSize, int SpanLength) { this.Event = Event; this.Matrix = Matrix; this.OffsetB1 = OffsetB1; this.OffsetB2 = OffsetB2; this.OffsetB3 = OffsetB3; this.BlockSize = BlockSize; this.SpanLength = SpanLength; } } private readonly struct ParallelMessage { // A reference to a countdown event to signal // when the message is processed // public readonly CountdownEvent Event; public readonly int[] Matrix; public readonly int OffsetB1; public readonly int OffsetB2; public readonly int OffsetB3; public readonly int BlockSize; public readonly int SpanLength; public ParallelMessage( CountdownEvent Event, int[] Matrix, int OffsetB1, int OffsetB2, int OffsetB3, int BlockSize, int SpanLength) { this.Event = Event; this.Matrix = Matrix; this.OffsetB1 = OffsetB1; this.OffsetB2 = OffsetB2; this.OffsetB3 = OffsetB3; this.BlockSize = BlockSize; this.SpanLength = SpanLength; } } The message is a readonly struct (for performance reasons, to avoid allocations) which includes all information required to initialise three Span's for B1 , B2 and B3 blocks and invoke the Procedure . readonly struct Span's B1 B2 B3 Procedure The process of conversion of ParallelMessage into the invocation of the Procedure is handled by an ParallelProcedure : ParallelMessage ParallelProcedure static void ParallelProcedure( ParallelMessage message) { var B1 = new Span<int>(message.Matrix, message.OffsetB1, message.SpanLength); var B2 = new Span<int>(message.Matrix, message.OffsetB2, message.SpanLength); var B3 = new Span<int>(message.Matrix, message.OffsetB3, message.SpanLength); Procedure(B1, B2, B3, message.BlockSize); // Signal the message has been processed // message.Event.Signal(); } static void ParallelProcedure( ParallelMessage message) { var B1 = new Span<int>(message.Matrix, message.OffsetB1, message.SpanLength); var B2 = new Span<int>(message.Matrix, message.OffsetB2, message.SpanLength); var B3 = new Span<int>(message.Matrix, message.OffsetB3, message.SpanLength); Procedure(B1, B2, B3, message.BlockSize); // Signal the message has been processed // message.Event.Signal(); } You can see, that besides information about the blocks, the ParallelMessage includes a reference to CountdownEvent , which ParallelProcedure signals when calculations are complete. ParallelMessage CountdownEvent ParallelProcedure Here is the code: private static void ParallelVectorOptimisations( int[] matrix, int block_count, int block_size) { var iteration_sync = new CountdownEvent(0); var lineral_block_size = block_size * block_size; var lineral_block_row_size = block_count * lineral_block_size; for (var m = 0; m < block_count; ++m) { var offset_mm = m * lineral_block_row_size + m * lineral_block_size; var mm = new Span<int>(matrix, offset_mm, lineral_block_size); Procedure(mm, mm, mm, block_size); // We calculate how many signals we expect to receive // (when all signals are received the event itself becomes signalled) // // We expect to have one row of blocks... // ("horizontal" blocks except diagonal block) // // ...and one column of blocks // ("vertical" blocks except diagonal block) // // Hence, 2 * block_count - 2 = 2 * (block_count - 1) // iteration_sync.Reset(2 * (block_count - 1)); for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var offset_mi = m * lineral_block_row_size + i * lineral_block_size; var message_x = new ParallelMessage( iteration_sync, matrix, offset_im, offset_im, offset_mm, block_size, lineral_block_size); var message_y = new ParallelMessage( iteration_sync, matrix, offset_mi, offset_mm, offset_mi, block_size, lineral_block_size); ThreadPool.QueueUserWorkItem(ParallelProcedure, message_x, false); ThreadPool.QueueUserWorkItem(ParallelProcedure, message_y, false); } } // Here we are waiting for all "horizontal" and "vertical" blocks // to be recalculated // iteration_sync.Wait(); // Now we expect all blocks except one column and one row // which essentially mean if we have a 4x4 matrix of blocks // we expect to recalculate 3x3 blocks // // Hence, (block_count - 1) * (block_count - 1) // iteration_sync.Reset((block_count - 1) * (block_count - 1)); for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; for (var j = 0; j < block_count; ++j) { if (j != m) { var offset_ij = i * lineral_block_row_size + j * lineral_block_size; var offset_mj = m * lineral_block_row_size + j * lineral_block_size; var message = new ParallelMessage( iteration_sync, matrix, offset_ij, offset_im, offset_mj, block_size, lineral_block_size); ThreadPool.QueueUserWorkItem(ParallelProcedure, message, false); } } } } // Waiting for all "peripheral" blocks to be recalculated // before moving to the next iteration, with next "diagonal" block // iteration_sync.Wait(); } } private static void ParallelVectorOptimisations( int[] matrix, int block_count, int block_size) { var iteration_sync = new CountdownEvent(0); var lineral_block_size = block_size * block_size; var lineral_block_row_size = block_count * lineral_block_size; for (var m = 0; m < block_count; ++m) { var offset_mm = m * lineral_block_row_size + m * lineral_block_size; var mm = new Span<int>(matrix, offset_mm, lineral_block_size); Procedure(mm, mm, mm, block_size); // We calculate how many signals we expect to receive // (when all signals are received the event itself becomes signalled) // // We expect to have one row of blocks... // ("horizontal" blocks except diagonal block) // // ...and one column of blocks // ("vertical" blocks except diagonal block) // // Hence, 2 * block_count - 2 = 2 * (block_count - 1) // iteration_sync.Reset(2 * (block_count - 1)); for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; var offset_mi = m * lineral_block_row_size + i * lineral_block_size; var message_x = new ParallelMessage( iteration_sync, matrix, offset_im, offset_im, offset_mm, block_size, lineral_block_size); var message_y = new ParallelMessage( iteration_sync, matrix, offset_mi, offset_mm, offset_mi, block_size, lineral_block_size); ThreadPool.QueueUserWorkItem(ParallelProcedure, message_x, false); ThreadPool.QueueUserWorkItem(ParallelProcedure, message_y, false); } } // Here we are waiting for all "horizontal" and "vertical" blocks // to be recalculated // iteration_sync.Wait(); // Now we expect all blocks except one column and one row // which essentially mean if we have a 4x4 matrix of blocks // we expect to recalculate 3x3 blocks // // Hence, (block_count - 1) * (block_count - 1) // iteration_sync.Reset((block_count - 1) * (block_count - 1)); for (var i = 0; i < block_count; ++i) { if (i != m) { var offset_im = i * lineral_block_row_size + m * lineral_block_size; for (var j = 0; j < block_count; ++j) { if (j != m) { var offset_ij = i * lineral_block_row_size + j * lineral_block_size; var offset_mj = m * lineral_block_row_size + j * lineral_block_size; var message = new ParallelMessage( iteration_sync, matrix, offset_ij, offset_im, offset_mj, block_size, lineral_block_size); ThreadPool.QueueUserWorkItem(ParallelProcedure, message, false); } } } } // Waiting for all "peripheral" blocks to be recalculated // before moving to the next iteration, with next "diagonal" block // iteration_sync.Wait(); } } The implementation is straightforward (except, maybe the usage of CountdownEvent and .NET concurrency primitives). CountdownEvent Let’s see how combination of vectorisation and parallelism impacts the performance. Experimental Results Here are the execution results: Algorithm (Parallel Vector) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 32.311 s 0.0542 s 0.0480 s 4,277,045,385 Blocked Floyd-Warshall 4800 (120) 3.396 s 0.0014 s 0.0013 s 90,435,311 Algorithm (Parallel Vector) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 32.311 s 0.0542 s 0.0480 s 4,277,045,385 Blocked Floyd-Warshall 4800 (120) 3.396 s 0.0014 s 0.0013 s 90,435,311 Algorithm (Parallel Vector) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Algorithm (Parallel Vector) Algorithm (Parallel Vector) Graph Graph Mean (s) Mean (s) Error (s) Error (s) StdDev (s) StdDev (s) LLC Miss / Op LLC Miss / Op Floyd-Warshall 4800 32.311 s 0.0542 s 0.0480 s 4,277,045,385 Floyd-Warshall Floyd-Warshall 4800 4800 32.311 s 32.311 s 0.0542 s 0.0542 s 0.0480 s 0.0480 s 4,277,045,385 4,277,045,385 Blocked Floyd-Warshall 4800 (120) 3.396 s 0.0014 s 0.0013 s 90,435,311 Blocked Floyd-Warshall Blocked Floyd-Warshall 4800 (120) 4800 (120) 3.396 s 3.396 s 0.0014 s 0.0014 s 0.0013 s 0.0013 s 90,435,311 90,435,311 Table 7. Experimental results of “Parallel Vector” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms. This isn’t a mistake. Introduction of the parallelism into Blocked Floyd-Warshall algorithm results in almost x20 performance improvement compared to vectorised version, while Floyd-Warshall speedup is much more moderate – around x2 . x20 x2 Running the code under the profiles reveals the reasons: Algorithm (Vector) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 2,038,598,100,000 606,769,800,000 3.360 78.9% Blocked Floyd-Warshall 254,444,400,000 483,594,300,000 0.526 0.0% Algorithm (Vector) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 2,038,598,100,000 606,769,800,000 3.360 78.9% Blocked Floyd-Warshall 254,444,400,000 483,594,300,000 0.526 0.0% Algorithm (Vector) Clock Ticks Instructions Retired CPI Rate Memory Bound Algorithm (Vector) Algorithm (Vector) Clock Ticks Clock Ticks Instructions Retired Instructions Retired CPI Rate CPI Rate Memory Bound Memory Bound Floyd-Warshall 2,038,598,100,000 606,769,800,000 3.360 78.9% Floyd-Warshall Floyd-Warshall 2,038,598,100,000 2,038,598,100,000 606,769,800,000 606,769,800,000 3.360 3.360 78.9% 78.9% Blocked Floyd-Warshall 254,444,400,000 483,594,300,000 0.526 0.0% Blocked Floyd-Warshall Blocked Floyd-Warshall 254,444,400,000 254,444,400,000 483,594,300,000 483,594,300,000 0.526 0.526 0.0% 0.0% Table 8. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Vector” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms. The Memory Bound is crushing 78.9% – which essentially means, most of the time, the CPU was waiting for data from memory instead of doing any kind of calculations. Memory Bound This might seem unexpected because before the stats of both algorithms were quite close. But don’t forget the important part – modern CPU’s are extremely good at single-threaded execution but it is very different when it comes to multi-threading. Just look at the results of parallel (non-vectorised) version of Floyd-Warshall algorithm: Algorithm (Parallel) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 27.629 s 0.0101 s 0.0094 s 4,571,752,038 Algorithm (Parallel) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Floyd-Warshall 4800 27.629 s 0.0101 s 0.0094 s 4,571,752,038 Algorithm (Parallel) Graph Mean (s) Error (s) StdDev (s) LLC Miss / Op Algorithm (Parallel) Algorithm (Parallel) Graph Graph Mean (s) Mean (s) Error (s) Error (s) StdDev (s) StdDev (s) LLC Miss / Op LLC Miss / Op Floyd-Warshall 4800 27.629 s 0.0101 s 0.0094 s 4,571,752,038 Floyd-Warshall Floyd-Warshall 4800 4800 27.629 s 27.629 s 0.0101 s 0.0101 s 0.0094 s 0.0094 s 4,571,752,038 4,571,752,038 Table 9. Experimental results of “Parallel” implementation of Floyd-Warshall algorithm. It is FASTER (27.629 seconds vs 32.311 seconds) than vectorised version because it puts less pressure on memory: FASTER Algorithm (Parallel) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 1,907,472,000,000 2,449,542,900,000 0.779 30.5% Algorithm (Parallel) Clock Ticks Instructions Retired CPI Rate Memory Bound Floyd-Warshall 1,907,472,000,000 2,449,542,900,000 0.779 30.5% Algorithm (Parallel) Clock Ticks Instructions Retired CPI Rate Memory Bound Algorithm (Parallel) Algorithm (Parallel) Clock Ticks Clock Ticks Instructions Retired Instructions Retired CPI Rate CPI Rate Memory Bound Memory Bound Floyd-Warshall 1,907,472,000,000 2,449,542,900,000 0.779 30.5% Floyd-Warshall Floyd-Warshall 1,907,472,000,000 1,907,472,000,000 2,449,542,900,000 2,449,542,900,000 0.779 0.779 30.5% 30.5% Table 10. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Parallel” implementations of Floyd-Warshall algorithms. The Floyd-Warshall algorithm is bound by memory and its scaling is limited (even using more cores will result in just a tiny speedup). However, this doesn’t apply to Blocked Floyd-Warshall which actually CAN BE improved because currently it is not limited nor by CPU nor by memory. But this is a completely different story. CAN BE Conclusion In this post we implemented one more algorithm for solving all-pairs shortest path problem. Along the way, we learned basics of caching, reiterated on linear memory access, vectorisation and basic concurrency capabilities of .NET. We also saw with our own eyes how fast can be cache-aware (aka cache-friendly) algorithms, especially when it comes to parallel execution. I hope you enjoyed the reading and THANK YOU for staying to the end. THANK YOU References You can find mentioned latency values and more on 7-Zip LZMA Benchmarks. Venkataraman, G., Sahni, S., Mukhopadhyaya, S. A Blocked All-Pairs Shortest Paths Algorithm // Journal of Experimental Algorithmics (JEA). Vol 8. 2003. P. 857 – 874. Karasik, O. N., and A. A. Prihozhy. “Tuning Block-Parallel All-Pairs Shortest Path Algorithm For Efficient Multi-Core Implementation.” Системный анализ и прикладная информатика 3 (2022): 57-65. You can find mentioned latency values and more on 7-Zip LZMA Benchmarks . 7-Zip LZMA Benchmarks Venkataraman, G., Sahni, S., Mukhopadhyaya, S. A Blocked All-Pairs Shortest Paths Algorithm // Journal of Experimental Algorithmics (JEA). Vol 8. 2003. P. 857 – 874. Venkataraman, G., Sahni, S., Mukhopadhyaya, S. A Blocked All-Pairs Shortest Paths Algorithm // Journal of Experimental Algorithmics (JEA). Vol 8. 2003. P. 857 – 874. Karasik, O. N., and A. A. Prihozhy. “Tuning Block-Parallel All-Pairs Shortest Path Algorithm For Efficient Multi-Core Implementation.” Системный анализ и прикладная информатика 3 (2022): 57-65. Karasik, O. N., and A. A. Prihozhy. “Tuning Block-Parallel All-Pairs Shortest Path Algorithm For Efficient Multi-Core Implementation.” Системный анализ и прикладная информатика Системный анализ и прикладная информатика 3 (2022): 57-65.