Navigating the Parallel Programming Landscape: A Deep Dive into SIMD and MIMD Architecturesby@durganshu
126 reads

Navigating the Parallel Programming Landscape: A Deep Dive into SIMD and MIMD Architectures

by Durganshu MishraFebruary 20th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This article gives a detailed overview of the SIMD and MIMD programming models. Dig deep into what SIMD intrinsics, shared memory, distributed memory, and PGAS programming models are all about.
featured image - Navigating the Parallel Programming Landscape: A Deep Dive into SIMD and MIMD Architectures
Durganshu Mishra HackerNoon profile picture

In my previous blog, which introduced this multithreading series, we delved into Flynn’s taxonomy and meticulously explored various computer architectures. If you haven’t had the chance to peruse it yet, I highly encourage you to do so.

In this installment, we circle back to delve deeper into SIMD and MIMD architectures, fundamental pillars of parallel computing. These architectures are the cornerstone upon which several parallel programming models are built. The evolution of these models closely mirrors advancements in computer architecture. Hence, understanding architectural intricacies becomes imperative for maximizing resource efficiency.

In this journey, we’ll discuss how these architectures differ and what types of parallel programming models are, shedding light on their diverse applications and implementations.

Highlights: SIMD or Vector Programming| Shared Memory Model | Distributed Memory model | Distributed Shared Memory Model | PGAS

Let's Go GIF by Billboard Music Awards — on GIPHY

SIMD or Vector Programming

Modern processors operate as vector units under the hood. In contrast to scalar processors, which handle data elements individually, modern vector processors excel at processing one-dimensional arrays of data at once. Notably, Intel’s x86 processors, incorporating SSE, AVX, or AVX-512 instruction sets, uniformly embrace SIMD instructions.

Similarly, Nvidia GPUs use the concept of ‘Warps’, and AMD uses ‘Wavefronts’.

How to exploit SIMD capabilities?

Well, there are many ways of doing so. Most modern compilers are very capable of auto-vectorizing the code. We discussed some of these compiler flags (for Intel C++ Compilers) in my post on Compiler Optimizations. Make sure to check that out.

GCC also supports a lot of these optimizations. Apart from compiler flags similar to ICC, pragmas can also be used to request available optimizations.

#pragma GCC optimize("O3,unroll-loops")

#pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")

Another common method is using the SIMD intrinsic functions.

SIMD intrinsics are specialized functions or operations provided by programming languages and compilers to utilize the capabilities of SIMD architectures directly. They allow programmers to explicitly write code that takes advantage of parallel processing capabilities without relying solely on compiler optimizations.

These intrinsics map directly to specific SIMD instructions supported by the hardware.

GCC allows users to include headers specific to the supported instruction sets (like xmmintrin.h for SSE, emmintrin.h for SSE2, etc.). For convenience, immintrin.h can be included for auto-detection of the supported instruction set and automatically importing the most suitable file.

The details of implementing and using SIMD intrinsics deserve an article of their own, and we’ll skip them for now. But feel free to check some of the code samples and Intel intrinsics guide to know more about them.

Besides pragmas and intrinsics, libraries, APIs, and even separate languages (built on C/C++) like CUDA, Kokkos, OpenCL, etc., allow the users to leverage the SIMD capabilities of the underlying hardware.

MIMD: Shared memory vs distributed memory

MIMD represents the most well-known computer architectures. Depending on how the processors access the memory, they are classified into two categories: Shared and distributed memory.

Shared Memory Models

In shared memory architecture, multiple processors or cores within a system share access to a common, centralized memory space. All processors can read and write data to the same physical memory locations. This shared memory space allows for communication and data sharing among the processors, facilitating coordination and collaboration in parallel processing tasks.

From the viewpoint of a programmer, the shared memory model inherently assumes cache coherency (handled by the operating system), and the programmer doesn’t need to worry about that. The OS ensures data consistency (data synchronization) and control flow (control synchronization) for coordinating accesses between the processing elements (like threads).

However, there is a lack of flexibility in distributing the tasks. Also, scalability becomes an issue after a certain number of processors. It suffers from low peak performance, and after a certain number of processors, the scalability becomes an issue, and the latency increases as multiple processors compete for the limited memory.

Over the years, several variants of the shared memory model have been proposed. The two most popular ones are Uniform Memory Access (UMA) and Non-uniform Memory Access systems (NUMA).

  1. Uniform Memory Access (UMA)

    In Uniform Memory Access (UMA), all the processors in a multiprocessor system have equal access time to the shared, global memory. Thus, accessing any location in the shared memory takes approximately the same time (same latency), regardless of which processor initiates the access. This architecture is often associated with bus-based symmetric multiprocessors (SMP).In the earlier days, computing systems were designed to follow this philosophy. However, as the number of processors increased, the system bus/crossbar switch connecting the processors to the main memory became a bottleneck. This posed a significant scalability issue, and alternatives were sought.

  2. Non-uniform Memory Access systems (NUMA)

    Modern multi-processor computing systems follow the NUMA model. Here, the memory is divided into multiple regions, and each processor or group of processors is assigned to a specific memory region. Thus, a processor can access its local memory faster than remote memory (memory local to another processor). The NUMA systems are designed to address scalability concerns posed by UMA systems. As every processor has its local memory, this increases memory bandwidth. However, these pros come at the price of more programmer intervention.NUMA issues are pervasive for HPC applications. The programmer should carefully visualize the memory access patterns for the best performance. Standard programming practices include identifying the performance of row-major/column-major accesses, using the first-touch allocation strategy for threads, thread affinity for shared memory spaces, etc.

Possible UMA and NUMA configurations. Drawn by the author on Excalidraw

Some of the other shared memory architectures include Cache Only Memory Access (COMA), Cache-Coherent NUMA (ccNUMA), and non-ccNUMA (nccNUMA). However, these are not very commonly used. Almost all of the modern computing systems are based on the combination of UMA and NUMA.

The most popular shared memory programming models include OpenMP, OpenACC, POSIX Threads, C++ Threads, Intel TBB, etc. In the coming blog series, we’ll discuss in detail POSIX Threads, C++ Threads, and OpenMP.

Distributed Memory Models

Traditionally, computing has been confined to a single node, limiting the scale and efficiency of operations. However, by interconnecting disparate machines, the distributed memory model opens up a realm of possibilities for enhanced performance and scalability.

At its core, the distributed memory model leverages independent hardware units or nodes, each equipped with processors and memory units. Multiple of these nodes distribute a given task among them and operate independently. This fragmentation allows for different levels of integration, accommodating a diverse range of computing resources and configurations.

Distributed memory architectures have seen much development in past decades. From the Beowulf clusters to the modern-day blade servers, many configurations have existed (and continue to exist) that allow different levels of integration between the software and hardware.

Distributed memory systems have a clear advantage in scalability. For example, let’s assume we have to do some computations on a matrix of size 100,000x100,000. Doing this on a single machine (even when fully utilizing the shared memory) will be time-consuming and resource-intensive. One of the solutions is distributing specific columns and rows to different nodes and letting them handle their fragment of computations.

The background color identifies the node computing the corresponding element. Drawn by the author on Excalidraw

Distributing matrix elements in this fashion allows many computations at once. Every node can focus on its part of the problem, and as soon as they’re done, one of the nodes can combine the final result. This decentralized approach facilitates load balancing and unlocks unprecedented levels of parallelism. It’s also important to highlight that each node can further exploit the shared memory architecture locally and use a shared memory programming model like OpenMP to potentially improve performance.

However, with great power comes great complexity. The distributed nature of this model introduces additional challenges that must be navigated with precision. Explicit communication between nodes becomes imperative, as data exchange and synchronization are essential for coherent execution. Moreover, the distribution of the problem itself demands careful consideration to ensure accurate partitioning and efficient utilization of resources. For instance, matrix elements being calculated in one of the nodes may depend on other nodes’ elements. The developer has to be very careful in handling such cases and communicate the data elements at appropriate time steps.

Message passing is one of the most common programming models for distributed memory systems. Many APIs and libraries that implement this interface exist. The most popular one is MPI, which allows point-to-point and collective operations to exchange and aggregate data from one or multiple nodes at once. MPI deserves an article of its own, and we’ll skip it for now.

Possible configuration of a distributed memory architecture. Drawn by the author on Excalidraw

Distributed Shared Memory Model (DSM or DGAS)

Both shared and distributed memory models offer their own set of pros and cons. To capitalize on the strengths of both, the Distributed Shared Memory model (DSM), also known as Distributed Global Address Space (DGAS), has been introduced.

In the DSM approach, each node exposes a portion of its local memory for access by other nodes, thus embodying a distributed memory paradigm. However, this distributed memory can be accessed as a shared address space, resulting in a seamless DSM environment. Consequently, the programmer is only required to specify the shared address of the DSM.

Logically, the programmer sees a shared memory, but physically, a distributed memory lies underneath. All necessary communication between nodes occurs transparently, simplifying the programming process and making the concept straightforward to use.

Complexities and Challenges

Like any model, DSM has drawbacks and intricacies that demand careful consideration during implementation. These complexities include several critical aspects:

  • Determining the specific nodes where data objects should reside within the distributed system.
  • Establishing a robust methodology for assigning addresses to data objects and efficiently retrieving them upon request.
  • Ensuring synchronization within the shared address space to maintain consistency. Modifications made by any node must reflect uniformly across all nodes, preventing discrepancies and potential indeterministic behavior.
  • Implementing a reliable mechanism to uphold fault tolerance, thereby mitigating unexpected issues that may arise across any node within the system.

Partitioned Global Address Space (PGAS)

DSM systems can be developed through various approaches, including direct hardware implementation or an operating system-based approach. Fortunately, a few programming models have been developed that enable DSM implementation without requiring hardware or OS modifications. One such approach is the Partitioned Global Address Space method.

The Partitioned Global Address Space (PGAS) is one of the most used programming models for implementing a Distributed Shared Memory (DSM) environment. In PGAS, all nodes collaboratively contribute to form a shared global address space corresponding to the DSM. Each node assumes responsibility for managing a distinct part of this global address space stored within its local memory. Consequently, the memory of each node is bifurcated into two segments: shared and private memory.

The shared memory segment is universally accessible by all the nodes. Every data object stored in the DSM is assigned a global address, enabling PGAS implementations to determine the appropriate node for accessing the desired data object. Should the global address point to a different node, a request is dispatched to the responsible node for access to this data object. In contrast, the private memory segment can solely be accessed from the local node.

Possible configuration of a PGAS model. Drawn by the author on Excalidraw

Some of the PGAS implementations include UPC++, OpenSHMEM, and Titanium. All of these and many other PGAS implementations use the Single Program Multiple Data (SPMD) model.

SPMD is a parallel programming paradigm where multiple processing units execute a single program simultaneously, each operating on different data. In SPMD, all program instances follow the same control flow but may act on other portions of data.

What next?

This brings us to the end of the second part of this blog series on multithreading. Equipped with insights into various parallel programming models and architectures, understanding the essence of multithreading will be much easier. In the next part, we’ll dig deep into what a thread is and why it is essential for parallel computing.

Leaving See Ya GIF by MOODMAN — on GIPHY

Suggested Reads:

[1] Distributed shared memory frameworks: Comparison of implementations (

[2] Distributed Computing: Principles, Algorithms, and Systems — Kshemkalyani and Singhal (

[3] An Introduction to Flynn’s taxonomy | Durganshu Mishra

Also published here.

Featured Photo was generated with DALL·E 3.