Effective Parallel Computing

Author profile picture

@mukulmalikMukul Malik

“For over a decade prophets have voiced …… single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers.” — Gene Amdahl in 1967

Now that statement was made 49 years ago! So for almost half a century people have realised that serial computing is going to lead us nowhere.

So serial codes are like 70’s clothing, yes it was nice back then but not anymore, for God’s sake!

Until recently there weren’t many applications that used parallel processing, in fact they had not need for it. It was video games that inspired the GPU (In your face parents!).

Now with introduction of machine learning, there couldn’t be a more useful application for GPUs. Why?

Well effectively Machine Learning is applying same computational steps over and over again, millions or billion of times! Now these tasks are not very complex in nature but they still take some time. Minimising that time is very hard. Luckily, these computations are rather independent of each other.

Now compared to a CPU which generally have 4 physical cores or 8 virtual cores (You may go for E7 Xeons but you have to be a millionaire for that!), GPUs like Nvidia 1080 has 2560 cores! That’s over 160 times for same price!

So what we do to speed things up? We run a few thousand computations in parallel!


A boy has got punishment that he has to write “I won’t sleep during class anymore” a million times i.e. 1,000,000 times (Scary teacher this one would be!).

It takes him 14 secs to write one of those line in general, if he speed things up he might be able to do it in 10 secs. Now for writing it 1,000,000 times he would require:

10 * 1,000,000 sec = 166667 mins = 2777 hours = 116 days = 4 months

So he would be writing that for 4 months straight! Does not make sense!

Now if he could have 1,000 people helping him to write the same thing but each of them takes 20 secs for write one sentence:

(20 * 1,000,000) / 1,000 secs = 167 mins = 2.777 hours

Despite everyone writing at half the speed, the speedup is insane!

Workload of almost 4 months has been done under 3 hours!!

How does it happen, in depth!

First GPU vs CPU


  1. High Throughput
  2. Less complex computations
  3. Less flexible in terms of working
  4. Clocked at significantly lower clock speed


  1. Low Latency
  2. Very complex computations
  3. Very flexible in terms of working
  4. Clocked at significantly higher clock speeds

Now GPU is like a bunch of hard working interns, they can’t take very critical decisions but given a work they will dedicated work on it.

So GPU acts like a Co-Processor and not like the primary processor. In terms of master-slave computing, GPU is slave to CPU’s command.

Now let’s have a deeper look into these steps:

  1. CPU launches kernels on the device (Kernel is a serial code which is a small part of original code.)
  2. CPU allocates memory on GPU
  3. CPU copies input data to GPU
  4. CPU launches kernels on GPU to process the input data
  5. CPU copies output results to itself

So how do we define parallelism in GPU

GPU Parallelism = No. of blocks * Threads per block

How does a typical kernel call look?

<< Grid of Blocks , Block of Threads , Shared Memory per Block >>
SyncThread is called, it waits until all threads within the same block have finished execution. The instruction of waiting for other threads to complete is called the implicit barrier.


What are the fundamental instructions that make computing possible and how are they implemented?

Well Starting off with basics:

Vector Operations

Vector operations execute element by element operations on corresponding elements of vectors (Single or Multi-dimensional). If the operands have the same size, then each element in the first operand gets matched up with the element in the same location in the second operand. If the operands have compatible sizes, then each input is implicitly expanded as needed to match the size of the other.

Example : Adding two Vectors of different sizes

[ [1,2] , [3,4] ] + [5,6] <- Will give an Error

So we expand the second matrix

[ [1 , 2] , [3 , 4] ] + [ [5 , 6] , [0 , 0] ]

Let’s try element wise multiplication

[ [1*5 , 2*6] , [3*0 , 4*0] ] = [ [5 , 12] , [0 , 0] ]

Matrix Operations

Matrix operations follow the rules of linear algebra and are not compatible with multidimensional arrays. The required size and shape of the inputs in relation to one another depends on the operation. For non-scalar inputs, the matrix operators generally calculate different answers than their array operator counterparts.


Multiplication of two matrix using matrix operations will result in

[ [(1*5)+(2*0) , (6*1)+(2*0)] , [(3*5)+(4*0) , (3*6)+(4*0)] ]

Which equals to:

[ [5 ,   6] ,    [15 , 18] ]
Memops : Memory operations. Writing or Reading data onto memory.
Flops : Computational operations. Time taken to computer an instruction.

BLAS (Basic Linear Algebra Subprograms)

Dense linear algebra operations are often at the heart of scientific computations that stress even the fastest computers available. As a result, it is important that routines that compute these operations attain high performance in the sense that they perform near the minimal number of operations and achieve near the highest possible rate of execution.

Level 1 : AXPY

Denotes Vector-Vector operations of the form

y <- αx + y

Or y equals ax ‘plus’ y. Here

α ∈ R 


x,y ∈ R^n

Here ‘a’, ‘x’ and ‘y’ rest in the main memory. An efficient implementation of axpy will load ‘a’ from memory to a register and will then compute with ‘x’ and ‘y’, which must be fetched from memory as well. The updated result, ‘y’, must also be stored, for a total of about 3n memops for the 2n flops that are executed. Thus, three memops are required for every two flops. If memops are more expensive than flops (as usually is the case), it is the memops that limit the performance that can be attained for axpy.

i.e. the speed of loading content into and from memory forms the bottleneck.

Level 2 : GEMV

Denotes General Matrix-Vector operations of the form

y <- Ax + y

Here ‘x’ and ‘y’ are Vectors but ‘A’ is a matrix.

x ,  y ∈ R^n
A ∈ R^(nXn)

This operation involves roughly n^2 data (for the matrix), initially stored in memory, and 2n^2 flops. Thus, an optimal implementation will fetch every element of ‘A’ exactly once, yielding a ratio of one memop for every two flops. Although this is better than the ratio for the axpy, memops still dominate the cost of the algorithm if they are much slower than flops.

Level 3 : GEMM

Denotes General Matrix Matrix operations of the form

C <- α op(A) op(B) + βC


α , β ∈ R
op(A) ∈ R^m×k 
op(B) ∈ R^k×n 
C ∈ R^m×n

The term ‘op(X)’ here signifies that the matrix can either be X or X^T (transpose).

So what that equation essentially means is that the above equation can take the following forms :

C <- α (A B) + βC
C <- α (A^T B) + βC
C <- α A B^T + βC
C <- α (A^T B^T) + βC

Consider the product C := AB + C where all three matrices are square of order n. This operation involves 4n^2 memops (A and B must be fetched from memory while C must be both fetched and stored) and requires 2n^3 flops for a ratio of 4n^2 /2n^3 = 2/n memops/flops. Thus, if n is large enough, the cost of performing memops is small relative to that of performing useful computations with the data, and there is an opportunity to amortize the cost of fetching data into the cache over many computations.

Now there are Many implementations of GEMM on basis of m,n,k (Dimensions of Matrices)


C <- A^(m X k) B^(k X n) + C^(m^n)

Following are the ways to perform it

Algorithms for GEMM

GEMM is still one of the most optimizable way of computing with only issue, that it loads a large size of information in memory. So we break matrix into smaller matrices and multiply them in batches.

Gaussian Estimation based methods like LU factorisation are commonly used to excel the MatMul problems. When applicable, Cholesky Decomposition can be almost three times more efficient than LU factorisation.

If you are not familiar with Gaussian Estimation then this video will be very helpful

Example :

So some Deep Learners like CNN (Convolution Neural Networks) are can inherently be parallelised to a much greater extent than others like RNN (Recurrent Neural Network).

In Case of CNN, let’s take an example of a 4096*4096 size image being analysed. That means there are essentially there are 16,777,216 pixels to be analysed. Here in pooling layer, each block can be viewed as independent computation. Hence each block can be solved independently and just the results can be combined. So Heavy Parallelism can be implemented.

This also means that it could be broken into batches easily in case if not all 16 million pixels fit in memory. So we could load a small blocks of these 16 million pixels in memory and still optimise them and apply GEMM.

Let’s assume this is an image


This block will be broken into smaller matrices


GPU Computation

What is the architecture of GPU Computation?


It stands for General-purpose computing on graphics processing units which basically implies the use of GPU for computations.

Essentially, a GPGPU pipeline is a kind of parallel processing between one or more GPUs and CPUs that analyzes data as if it were in image or other graphic form. Migrating data into graphical form and then using the GPU to scan and analyze it can result in profound speedup.
— Wikipedia

GPU Based Routines

How computation on GPU is done at Fundamental levels? And Why is it so effective?


The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant. CPU generates a fragment for each pixel on screen and applies a fragment program to each one. The result stream of the same size is stored in the output buffer.


Some computations require calculating a smaller stream (possibly a stream of only 1 element) from a larger stream. This is called a reduction of the stream. Generally, a reduction can be performed in multiple steps. The results from the prior step are used as the input for the current step and the range over which the operation is applied is reduced until only one stream element remains.

Stream filtering

Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria.


The scan operation, also termed parallel prefix sum, takes in a vector (stream) of data elements and an (arbitrary) associative binary function ‘+’ with an identity element ‘i’.

If the input is [a0, a1, a2, a3, …], an exclusive scan produces the output [i, a0, a0 + a1, a0 + a1 + a2, …], while an inclusive scan produces the output [a0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, …]. While at first glance the operation may seem inherently serial, efficient parallel scan algorithms are possible and have been implemented on graphics processing units. The scan operation has uses in e.g., quicksort and sparse matrix-vector multiplication.


The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of the vertex, which allows the programme to control where information is deposited on the grid.

Vertex Processor : Graphics system component that receives as input a set of 3D vertex and process them to obtain 2D screen positions. Present GPUs have multiple vertex processors working in parallel and can be programmed using vertex programs.
The fragment processor (non-vertex processors) cannot perform a direct scatter operation because the location of each fragment on the grid is fixed at the time of the fragment’s creation and cannot be altered by the programmer. However, a logical scatter operation may sometimes be recast or implemented with another gather step. A scatter implementation would first emit both an output value and an output address. An immediately following gather operation uses address comparisons to see whether the output value maps to the current output slot.


Gather is the reverse of scatter, after scatter reorders elements according to a map, gather can restore the order of the elements according to the map scatter used.


The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is using radix sort for integer and floating point data and coarse-grained merge sort and fine-grained sorting networks for general comparable data.


The search operation allows the programmer to find a given element within the stream, or possibly find neighbours of a specified element. The GPU is not used to speed up the search for an individual element, but instead is used to run multiple searches in parallel. Mostly the search method used is binary search on sorted elements.

Stream Computing

So what is steam computing? And why is it so effective?

The key to using the GPU for purposes other than real-time rendering is to view it as a streaming, data-parallel computers.

Streaming processors such as GPUs are programmed in a fundamentally different way than serial processors like today’s CPUs. CPU can write to any location in memory at any point in their program. Whereas a streaming processor, in contrast, can access memory in a much more structured manner. In the stream model, programs are expressed as series of operations on data streams. The elements in a stream (that is, an ordered array of data) are processed by the instructions in a kernel. A kernel operates on each element of a stream and writes the results to an output stream.

Working : The stream programming model restrictions allow GPUs to execute kernels in parallel and therefore process many data elements simultaneously. This data parallelism is made possible by ensuring that the computation on one stream element cannot affect the computation on another element in the same stream. Consequently, the only values that can be used in the computation of a kernel are the inputs to that kernel and global memory reads. In addition, GPUs require that the outputs of kernels be independent: kernels cannot perform random writes into global memory (in other words, they may write only to a single stream element position of the output stream). The data parallelism afforded by this model is fundamental to the speedup offered by GPUs over serial processors.

Stream of Data

Additionally : Current GPU fragment processors are single-instruction, multiple-data (SIMD) parallel processors. Current vertex processors are multiple-instruction, multiple-data (MIMD) machines.

Stream Flow

GPU Data Structure

What Data Structures are implemented for GPU computing? Why and When they are used?


A variety of data structures can be represented on the GPU:

  • Dense arrays
  • Sparse matrices (sparse array) [static or dynamic]
  • Adaptive structures (union type)

NVIDIA GPU Architecture

1. Multidimensional Arrays

Current GPUs provide only 2D rasterization and 2D frame buffers. I.E. current GPUs do not support 1D textures with more than 4,096 elements.

Current GPUs can therefore represent 1D arrays containing up to 16,777,216 (4,096x4,096) elements as each time this packed array is accessed from a fragment or vertex program, the 1D address is converted to a 2D coordinate.

Whereas three-dimensional arrays may be stored in one of two ways:

  • 3D texture with each slice stored in a separate 2D texture
Storing a 3D Texture with Separate 2D Slices
  • Packed into a single 2D texture (Harris et al. 2003, Lefohn et al. 2003, Goodnight et al. 2003)
3D Arrays Flattened into a Single 2D Texture

Higher-dimensional arrays can be packed into 2D textures using a generalised forms (Buck et al. 2004).

2. Structures

A “stream of structures” must be defined instead as a “structure of streams”. In this construct, a separate stream is created for each structure member. In addition, the structures may not contain more data than can be output per fragment by the GPU. These restrictions are due to the inability of fragment programs to specify the address to which their frame-buffer result is written (that is, they cannot perform a scatter operation). By specifying structures as a “structure of streams,” each structure member has the same stream index, and all members can therefore be updated by a single fragment program.

3. Sparse Data Structures

The arrays and structures are dense structures. In other words, all elements in the address space of the arrays contain valid data. There are many problems, however, whose efficient solution requires sparse data structures (such as lists, trees, or sparse matrices). Sparse data structures are an important part of many optimized CPU-based algorithms; brute-force GPU-based implementations that use dense data structures in their place are often slower than their optimized CPU counterparts. In addition, sparse data structures can reduce an algorithm’s memory requirement — an important consideration given the limited amount of available GPU memory.

3.1 Static Sparse Structures : In these structures, the location and number of sparse elements are fixed throughout the GPU computation. For example, the location and number of triangles in the ray-traced scene do not change. Because the structures are static, they do not have to write to computed memory addresses.

All of these structures use one or more levels of indirection to represent the sparse structures in memory. Purcell’s ray acceleration structure, for example, begins with a regular 3D grid of triangle list pointers. The 3D grid texture contains a pointer to the start of the triangle list (stored in a second texture) for that grid cell. Each entry in the triangle list, in turn, contains a pointer to vertex data stored in a third texture. Similarly, the sparse matrix structures use a fixed number of levels of indirection to find nonzero matrix elements.

Purcell’s Sparse Ray-Tracing Data Structure

3.2 Dynamic Sparse Structures : GPU-based sparse data structures that are updated during a GPU computation are an active area of research. Two noteworthy examples are the photon map in Purcell et al. 2003 and the deformable implicit surface representation in Lefohn et al. 2003, 2004.

A photon map is a sparse, adaptive data structure. Purcell et al. (2003) describe an entirely GPU-based photon map renderer. To create the photon map on the GPU, they devise two schemes for writing data to computed memory addresses (that is, scatter) on current GPUs

  1. Computes the memory addresses and the data to be stored at those addresses. It then performs the scatter by performing a data-parallel sort operation on these buffers.
  2. Stencil routing, uses the vertex processor to draw large points at positions defined by the computed memory addresses. Here is the link if you want to study this in detail.
Another GPU-based dynamic sparse data structure is the sparse volume structure used for implicit surface deformation by Lefohn et al. (2003, 2004).

A key component of the system is the way in which the GPU requests that the CPU allocate or free tiles. In summary, this dynamic sparse data structure solves the problem of requiring scatter functionality by sending small messages to the CPU when the GPU data structure needs to be updated. The structure uses the blocking strategy which is effective for several reasons:

  1. The amount of GPU-CPU communication is minimised by using the compressed bit vector message format.
  2. CPU serves only as a memory manager and lets the GPU perform all of the “heavy” computation. Note that the implicit surface data resides only on the GPU throughout the deformation.
  3. The dynamic sparse representation enables the computation and memory requirements to scale with the surface area of the implicit surface rather than the volume of its bounding box. This is an important optimisation, which if ignored would allow CPU-based implementations to easily outpace the GPU version.


CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).

CUDA 8.0 comes with the following libraries:

  • CUBLAS — CUDA Basic Linear Algebra Subroutines library
  • CUDART — CUDA RunTime library
  • CUFFT — CUDA Fast Fourier Transform library
  • CURAND — CUDA Random Number Generation library
  • CUSOLVER — CUDA based collection of dense and sparse direct solvers
  • CUSPARSE — CUDA Sparse Matrix library
  • NPP — NVIDIA Performance Primitives library
  • NVGRAPH — NVIDIA Graph Analytics library
  • NVML — NVIDIA Management Library
  • NVRTC — NVRTC RunTime Compilation library for CUDA C++


  • There is a paradigm shift in world of computation.
  • This shift is primarily the reason why Deep Learning of current level was made possible.
  • The thirst for more and more efficient ways of computations is only going to increase.
  • Investment in GPU computaion is increasing exponentially as GPUs do same work in just a small fraction of CPU’s cost.

This blog had minimal maths. If you want to study about Effective Computation Methods in details then “The Science of Programming Matrix Computations” by Robert A. van de Geijn

To read more about NVIDIA GPU this link would be helpful. For details on CUDA this is the index link.


The Noonification banner

Subscribe to get your daily round-up of top tech stories!