Learning About Threads: An Essential Guide for Developers

Process vs Threads, Hardware Threads vs Software Threads, Hyperthreading In today’s world of high-performance and responsive software, concurrency is no longer a luxury — it’s a necessity. Whether you’re building a real-time trading engine, a game engine, or a simple web server, understanding how to execute tasks in parallel can dramatically improve performance and user experience. This is where threading comes into play. concurrency threading In our previous blogs, we explored Flynn’s Taxonomy and various Parallel Programming Models — the essential building blocks of multithreading and advanced parallelism. Now that we’ve laid the groundwork, it’s time to move beyond theory and dive into the core question: What exactly is a thread, and what does it do? If you haven’t read those earlier stories yet, I highly recommend checking them out first for better context. Flynn’s Taxonomy Flynn’s Taxonomy Flynn’s Taxonomy Parallel Programming Models Parallel Programming Models Parallel Programming Models What exactly is a thread, and what does it do? What exactly is a thread, and what does it do? Threading allows programs to do more than one thing at a time: download files while updating the UI, compute results while responding to user input, or process incoming network requests in parallel. Behind the scenes, modern CPUs support multiple hardware threads, and programming languages like C++ provide powerful tools to take advantage of them. Threading allows programs to do more than one thing at a time: Threading allows programs to do more than one thing at a time: hardware threads Whether you’re new to multithreading or looking to reinforce your foundations, this story will give you a solid start. Let’s dive into the world of threads and unlock the power of parallel execution. Highlights: Threads vs processes | Hardware threads vs Software Threads | Hyperthreading | Fork/join threading model What Is a Thread? What Is a Thread? One of the first hurdles for beginners learning about concurrency is understanding the difference between a thread and a process. difference between a thread and a process A process is an independent program in execution. It has its own memory space, file descriptors, and system resources, isolated from other processes. For instance, when you open a browser or a terminal, you’re launching a new process. The operating system manages processes and does not share memory space unless explicitly configured to do so. process A thread, on the other hand, is the smallest unit of execution within a process. Multiple threads can exist within the same process, running concurrently and sharing the same memory space. This shared environment allows for faster communication between threads, but also opens the door to race conditions and synchronization issues if not managed properly. thread race conditions synchronization issues Conceptually, you can think of a thread as a lightweight process — an independent stream of execution with its own program counter, registers, and stack, but sharing heap and global memory with other threads in the same process. lightweight process program counter registers stack heap and global memory When discussing threads, it’s important to distinguish between hardware threads and software threads. Although both refer to “units of execution,” they operate at very different levels in the computing stack. hardware threads software threads What is a Hardware Thread? What is a Hardware Thread? A hardware thread is an execution stream directly supported by the processor. It is effectively a dedicated control unit within a core that can fetch, decode, and execute a stream of instructions independently. hardware thread directly supported by the processor dedicated control unit Traditionally, one processor equaled one hardware thread — in other words, one control unit per physical CPU. On systems with multiple sockets (e.g., server motherboards), there would be one hardware thread per socket. But this model evolved rapidly with the introduction of multi-core and multi-threaded architectures. one processor equaled one hardware thread one hardware thread per socket multi-core multi-threaded In modern CPUs: Each core contains at least one hardware thread. With Simultaneous Multithreading (SMT) — Intel’s version called Hyper-Threading — a single core can support multiple hardware threads. Now, one processor (socket) may have multiple cores, and each core may have multiple hardware threads. Each core contains at least one hardware thread. core hardware thread With Simultaneous Multithreading (SMT) — Intel’s version called Hyper-Threading — a single core can support multiple hardware threads. Simultaneous Multithreading (SMT) Hyper-Threading multiple hardware threads Now, one processor (socket) may have multiple cores, and each core may have multiple hardware threads. multiple hardware threads To add to the confusion, many operating systems report hardware threads or logical cores as “processors.” So, when you check your CPU information using system tools, the number of “processors” shown might refer to logical threads, not physical cores or sockets. operating systems report hardware threads or logical cores as “processors.” How do I check the number of hardware threads I have? Windows Windows Windows Windows There are several ways to check the number of hardware threads on Windows, but don’t worry, this isn’t one of those “10 ways to do it” blog posts. For quick reference, the easiest method is to open Task Manager using Ctrl + Shift + Esc. Head to the Performance tab and select CPU. Task Manager Ctrl + Shift + Esc Performance CPU You’ll see a summary that includes the number of cores and logical processors (i.e., hardware threads). It looks something like this: cores logical processors Other options, if you’d like to explore on your own: Alternative 1: PowerShell: Alternative 1: PowerShell: Alternative 1: (Get-WmiObject -Class Win32_Processor).NumberOfLogicalProcessors (Get-WmiObject -Class Win32_Processor).NumberOfLogicalProcessors Alternative 2: Command Prompt (requires wmic): Alternative 2: Command Prompt (requires wmic): Alternative 2: wmic cpu get NumberOfLogicalProcessors,NumberOfCores wmic cpu get NumberOfLogicalProcessors,NumberOfCores 2. Linux 2. Linux 2. Linux If you’re using any flavor of Linux, the configuration of your system and the number of Hardware threads can be checked by reading the /proc/cpuinfo. The output gives one entry for each hardware thread. One entry of this file looks like this: /proc/cpuinfo ~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 183 model name : 13th Gen Intel(R) Core(TM) i5-13450HX stepping : 1 microcode : 0xffffffff cpu MHz : 2611.201 cache size : 20480 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 28 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities vmx flags : vnmi invvpid ept_x_only ept_ad ept_1gb tsc_offset vtpr ept vpid unrestricted_guest ept_mode_based_exec tsc_scaling usr_wait_pause bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs retbleed eibrs_pbrsb rfds bhi bogomips : 5222.40 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: ... ~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 183 model name : 13th Gen Intel(R) Core(TM) i5-13450HX stepping : 1 microcode : 0xffffffff cpu MHz : 2611.201 cache size : 20480 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 28 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities vmx flags : vnmi invvpid ept_x_only ept_ad ept_1gb tsc_offset vtpr ept vpid unrestricted_guest ept_mode_based_exec tsc_scaling usr_wait_pause bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs retbleed eibrs_pbrsb rfds bhi bogomips : 5222.40 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: ... A few things to note from this: cpu cores: This might be misleading for hybrid architectures, as this number may not reflect the actual core count (e.g., my Intel i5-13th Gen has 10 cores: 6 P-cores + 4 E-cores, but it shows 8 cores). siblings: Refers to the total number of logical threads. The presence of the ht flag (short for Hyper-Threading) confirms that SMT is supported. cpu cores: This might be misleading for hybrid architectures, as this number may not reflect the actual core count (e.g., my Intel i5-13th Gen has 10 cores: 6 P-cores + 4 E-cores, but it shows 8 cores). cpu cores siblings: Refers to the total number of logical threads. siblings logical threads The presence of the ht flag (short for Hyper-Threading) confirms that SMT is supported. ht Hyper-Threading There are a couple of alternatives to this command, which also give clearer output: Alternative 1: lscpu Alternative 1: lscpu Alternative 1: lscpu I get the following output for lscpu: lscpu ~$ lscpu | grep -E 'Core|Socket|Thread' Model name: 13th Gen Intel(R) Core(TM) i5-13450HX Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 ~$ lscpu | grep -E 'Core|Socket|Thread' Model name: 13th Gen Intel(R) Core(TM) i5-13450HX Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Alternative 2: hwloc (lstopo) Alternative 2: hwloc (lstopo) Alternative 2: hwloc (lstopo) Another useful tool for inspecting your system’s CPU topology is the popular Linux utility hwloc, which provides both command-line and graphical representations of your hardware layout. It’s especially handy for visualizing cores, hardware threads, cache levels, and NUMA nodes. hwloc If hwloc is already installed, you can generate a visual map of your system’s architecture using the lstopo command: lstopo What is Hyper-Threading? Hyper-Threading (HT) is Intel’s implementation of Simultaneous Multithreading (SMT), allowing each physical core to run two instruction streams simultaneously. When one thread stalls (e.g., waiting on memory), the other can make use of the execution units. This leads to: - Better CPU utilization - Improved performance in I/O-bound or multitasking workloads ⚠ But: It doesn’t double performance — typical gains are around 15–30%. Caution for parallel programming:While HT benefits everyday multitasking (e.g., running multiple programs on a laptop), it can negatively affect performance in HPC or parallel workloads. Running multiple heavy threads on the same core can lead to resource contention and reduce speedup. This is why many HPC centers disable HT by default — careful thread scheduling is critical on SMT systems. Example — i5–13450HX: 6 P-cores with HT → 12 threads 4 E-cores without HT → 4 threads ➡️ Total = 16 logical threads What is Hyper-Threading? What is Hyper-Threading? Hyper-Threading (HT) is Intel’s implementation of Simultaneous Multithreading (SMT), allowing each physical core to run two instruction streams simultaneously. When one thread stalls (e.g., waiting on memory), the other can make use of the execution units. This leads to: two instruction streams - Better CPU utilization - Improved performance in I/O-bound or multitasking workloads ⚠ But: It doesn’t double performance — typical gains are around 15–30%. But 15–30% Caution for parallel programming: Caution for parallel programming: While HT benefits everyday multitasking (e.g., running multiple programs on a laptop), it can negatively affect performance in HPC or parallel workloads. Running multiple heavy threads on the same core can lead to resource contention and reduce speedup. This is why many HPC centers disable HT by default — careful thread scheduling is critical on SMT systems. negatively affect performance in HPC or parallel workloads resource contention many HPC centers disable HT by default Example — i5–13450HX: Example — i5–13450HX: 6 P-cores with HT → 12 threads 4 E-cores without HT → 4 threads ➡️ Total = 16 logical threads Total = 16 logical threads Understanding Software Threads: The Foundation Understanding Software Threads: The Foundation Unlike hardware threads, which exist at the processor level, software threads are programming abstractions that represent independent streams of execution within a process. A software thread is essentially a lightweight execution unit that exists within a process, sharing the same address space while maintaining its own execution context. programming abstractions A software thread is essentially a lightweight execution unit that exists within a process, sharing the same address space while maintaining its own execution context A software thread is essentially a lightweight execution unit that exists within a process, sharing the same address space while maintaining its own execution context When you create a software thread in your code, the operating system and runtime environment work together to map it onto available hardware threads for actual execution. This mapping is dynamic and depends on the thread scheduler, which determines when and where each thread runs. hardware threads hardware threads The distinction between software and hardware threads is crucial. Hardware threads represent the physical execution units available on your processor, while software threads are the abstraction that programmers work with. A single hardware thread can execute multiple software threads over time through context switching, and modern systems often support thousands of software threads running concurrently. The distinction between software and hardware threads is crucial. Hardware threads represent the physical execution units available on your processor, while software threads are the abstraction that programmers work with. A single hardware thread can execute multiple software threads over time through context switching, and modern systems often support thousands of software threads running concurrently. A single hardware thread can execute multiple software threads over time through context switching, and modern systems often support thousands of software threads running concurrently. The Evolution From Processes to Threads The Evolution From Processes to Threads To understand why threading was introduced, we must first examine the traditional process model. Operating systems historically managed processes as the primary unit of execution, where each process had: One execution stream running sequentially Complete isolation with separate address spaces, file descriptors, and user IDs Communication only through Inter-Process Communication (IPC) mechanisms Heavy resource overhead due to complete process duplication One execution stream running sequentially One execution stream Complete isolation with separate address spaces, file descriptors, and user IDs Complete isolation Communication only through Inter-Process Communication (IPC) mechanisms Inter-Process Communication (IPC) Heavy resource overhead due to complete process duplication Heavy resource overhead Threading was introduced to address the limitations of this process-centric model by enabling finer-grained concurrency. Threads provide several key advantages: Simplified Data Sharing Unlike processes, threads within the same process share the same address space, heap, and global variables. This eliminates the need for complex IPC mechanisms and allows for more efficient communication between concurrent execution units. Simplified Data Sharing Unlike processes, threads within the same process share the same address space, heap, and global variables. This eliminates the need for complex IPC mechanisms and allows for more efficient communication between concurrent execution units. Simplified Data Sharing Simplified Data Sharing Simplified Data Sharing Unlike processes, threads within the same process share the same address space, heap, and global variables. This eliminates the need for complex IPC mechanisms and allows for more efficient communication between concurrent execution units. share the same address space, heap, and global variables Resource Efficiency Creating a thread requires significantly fewer resources than creating a process. Thread creation typically requires only 64KB for the thread’s private data area and two system calls, while process creation involves duplicating the entire parent process address space. Resource Efficiency Creating a thread requires significantly fewer resources than creating a process. Thread creation typically requires only 64KB for the thread’s private data area and two system calls, while process creation involves duplicating the entire parent process address space. Resource Efficiency Resource Efficiency Resource Efficiency Creating a thread requires significantly fewer resources than creating a process. Thread creation typically requires only 64KB for the thread’s private data area and two system calls, while process creation involves duplicating the entire parent process address space. Thread creation typically requires only 64KB for the thread’s private data area and two system calls Enhanced Responsiveness Threads enable asynchronous behavior patterns that are essential for modern applications. Consider a web browser: one thread handles the user interface, another manages network requests, while others handle rendering and background tasks. This separation ensures that the interface remains responsive even when heavy operations are running. Enhanced Responsiveness Threads enable asynchronous behavior patterns that are essential for modern applications. Consider a web browser: one thread handles the user interface, another manages network requests, while others handle rendering and background tasks. This separation ensures that the interface remains responsive even when heavy operations are running. Enhanced Responsiveness Enhanced Responsiveness Enhanced Responsiveness Threads enable asynchronous behavior patterns that are essential for modern applications. Consider a web browser: one thread handles the user interface, another manages network requests, while others handle rendering and background tasks. This separation ensures that the interface remains responsive even when heavy operations are running. asynchronous behavior Operating System Level Scheduling Threads still benefit from OS-level scheduling features, including preemption (the ability to interrupt a thread) and fair progress guarantees among threads. This provides the balance between user control and system management. Operating System Level Scheduling Threads still benefit from OS-level scheduling features, including preemption (the ability to interrupt a thread) and fair progress guarantees among threads. This provides the balance between user control and system management. Operating System Level Scheduling Operating System Level Scheduling Operating System Level Scheduling Threads still benefit from OS-level scheduling features, including preemption (the ability to interrupt a thread) and fair progress guarantees among threads. This provides the balance between user control and system management. preemption Thread Architecture and Memory Model Thread Architecture and Memory Model Each thread maintains its own private execution context while sharing certain resources with other threads in the same process. Private Thread Resources Each thread has its own: Thread Control Block (TCB) containing thread ID, program counter, register set, and scheduling information Stack memory for local variables and function call management Program Counter tracking the current instruction being executed Thread Control Block (TCB) containing thread ID, program counter, register set, and scheduling information Thread Control Block (TCB) Thread Control Block (TCB) Stack memory for local variables and function call management Stack memory Stack memory Program Counter tracking the current instruction being executed Program Counter Program Counter Shared Resources All threads in a process share: Code section containing the program instructions Data section with global and static variables Heap memory for dynamically allocated data File descriptors and other system resources Code section containing the program instructions Code section containing the program instructions Code section Code section Data section with global and static variables Data section with global and static variables Data section Data section Heap memory for dynamically allocated data Heap memory for dynamically allocated data Heap memory Heap memory File descriptors and other system resources File descriptors and other system resources File descriptors File descriptors This shared memory model is both a strength and a challenge. While it enables efficient communication, it also introduces the need for careful synchronization to prevent data races and ensure thread safety. This shared memory model is both a strength and a challenge. While it enables efficient communication, it also introduces the need for careful synchronization to prevent data races and ensure thread safety. The Fork/Join Model: Structured Parallelism The Fork/Join Model: Structured Parallelism The fork/join model represents the most common pattern for structured parallel programming. This model provides a clean abstraction for dividing work among multiple threads and collecting results. An execution flow of a Fork/Join model looks as such: Sequential Start: The main thread begins executing sequentially Fork Phase: When parallel work is needed, the main thread creates (forks) new threads, each starting at a specified function Parallel Execution: Both main and spawned threads execute concurrently, potentially on different hardware threads Join Phase: The main thread waits for all spawned threads to complete before continuing Sequential Continuation: Execution resumes sequentially with results from parallel work Sequential Start: The main thread begins executing sequentially Sequential Start Sequential Start Fork Phase: When parallel work is needed, the main thread creates (forks) new threads, each starting at a specified function Fork Phase Fork Phase Parallel Execution: Both main and spawned threads execute concurrently, potentially on different hardware threads Parallel Execution: Parallel Execution: Join Phase: The main thread waits for all spawned threads to complete before continuing Join Phase: Join Phase: Sequential Continuation: Execution resumes sequentially with results from parallel work Sequential Continuation: Sequential Continuation: What’s Next? What’s Next? We’ve now reached the end of the third installment in this multithreading series. So far, we’ve covered the fundamental concepts of threads and processes, giving you a solid foundation to build on. In the next part, we’ll shift gears from theory to practice and explore the world of threading in action. Get ready to dive into POSIX threads (pthreads) and C++ std::thread, where we’ll write real code, analyze outputs, and understand how these libraries bring concurrency to life. POSIX threads (pthreads) C++ std::thread Suggested Reads Suggested Reads [1] Multithreaded Computer Architecture: A Summary of the State of the ART Multithreaded Computer Architecture: A Summary of the State of the ART [2] Distributed Computing: Principles, Algorithms, and Systems — Kshemkalyani and Singhal (uic.edu) Distributed Computing: Principles, Algorithms, and Systems — Kshemkalyani and Singhal (uic.edu)