How Goroutines Behave on CPU-Bound vs I/O-Bound Tasks

Goroutines are one of Go’s most powerful features; they make it easy to write concurrent programs by abstracting away thread management and scheduling. However, as with any powerful tool, misuse can lead to inefficiency or, worse still, subpar performance. In this article, we’ll explore when goroutines actually become a liability by examining two fundamental categories of workloads: CPU-bound and I/O-bound workloads.

Concurrency vs. Parallelism

Before diving into code, let’s clarify two concepts:

Concurrency: Dealing with multiple tasks at once (e.g. via goroutines). Tasks may or may not run in parallel (https://go.dev/blog/waza-talk).
Parallelism: Executing multiple tasks at the same time on multiple CPU cores.

A program can be concurrent without being parallel (e.g., goroutines running on a single core). Whether concurrency translates into parallelism depends on the type of workload and the availability of CPUs.

Workload Types: CPU-Bound vs I/O-Bound

CPU-Bound Work: The bottleneck is computation (e.g., calculating factorials, prime numbers, cryptographic hashing). The CPU is constantly busy, leaving little room for overlap.
I/O-Bound Work: The bottleneck is waiting for input/output (e.g., network calls, disk reads). While one goroutine waits, another can execute, making concurrency very effective.

CPU-Bound Example

Here’s a simple Factorial function and two versions of using it for calculating the factorial of multiple numbers: sequentially and with goroutines.

package factorial

import (
    "math/big"
    "sync"
)

func Factorial(n int64) *big.Int {
    result := big.NewInt(1)
    for i := int64(2); i <= n; i++ {
        result.Mul(result, big.NewInt(i))
    }
    return result
}

func Factorials(nSlice ...int64) {
    for _, n := range nSlice {
        Factorial(n) // we don't care about the result
    }
}

func FactorialsWithGoroutines(nSlice ...int64) {
    wg := sync.WaitGroup{}
    for _, n := range nSlice {
        wg.Add(1)
        go func(n int64) {
            Factorial(n) // we don't care about the result
            wg.Done()
        }(n)
    }
    wg.Wait()
}

Benchmarks

This is the System Report for the machine I would be running these benchmarks on:

Hardware Overview:

  Model Name:	MacBook Pro
  Model Identifier:	MacBookPro18,3
  Chip:	Apple M1 Pro
  Total Number of Cores:	8 (6 performance and 2 efficiency)
  Memory:	16 GB

Benchmarking Factorials and FactorialsWithGoroutines with different GOMAXPROCS to understand how they perform in different scenarios.

GOMAXPROCS controls the number of operating system threads allocated to goroutines in your program. The default value of GOMAXPROCS is the number of CPUs visible to the program at startup

To keep the benchmark results consistent, we would be turning off GOGC for all benchmarks in this article.

package factorial

import (
    "testing"
)

var numbers = reverseInts(1500, 500) // [1500, 1499, ..., 501, 500]

func reverseInts(start, end int64) []int64 {
    arr := []int64{}
    for i := start; i >= end; i-- {
        arr = append(arr, i)
    }
    return arr
}

func BenchmarkFactorials(b *testing.B) {
    for i := 0; i < b.N; i++ {
        Factorials(numbers...)
    }
}

func BenchmarkFactorialsWithGoroutines(b *testing.B) {
    for i := 0; i < b.N; i++ {
        FactorialsWithGoroutines(numbers...)
    }
}

GOMAXPROCS=1 (single core):

Note: GOMAXPROCS=1 doesn’t mean the machine has only one core, it just restricts Go to a single logical processor.

# first run
GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials                 1000     38572234 ns/op # sequential version ~54% less time
BenchmarkFactorialsWithGoroutines   1000     84424882 ns/op

# second run
GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials                  1000     39457195 ns/op # sequential version ~54% less time
BenchmarkFactorialsWithGoroutines    1000     85951160 ns/op

The concurrent version (FactorialsWithGoroutines) is actually slower than the sequential version because goroutines add scheduling overhead without providing parallelism.

GOMAXPROCS=2:

# first run
GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials-2                 1000    39183753 ns/op # sequential version ~19% less time
BenchmarkFactorialsWithGoroutines-2   1000    48595899 ns/op

# second run
GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials-2                  1000    38585135 ns/op # sequential version ~13% less time
BenchmarkFactorialsWithGoroutines-2    1000    44108161 ns/op

With 2 cores, FactorialsWithGoroutines improve slightly but still lag behind the sequential version due to scheduling overhead.

GOMAXPROCS=4:

# first run
GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials-4                  1000    39235967 ns/op
BenchmarkFactorialsWithGoroutines-4    1000    27673082 ns/op # goroutine version ~29% less time

# second run
GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials-4                  1000    39935345 ns/op
BenchmarkFactorialsWithGoroutines-4    1000    28286147 ns/op # goroutine version ~29% less time

Finally, with 4 cores, FactorialsWithGoroutines start to shine; work is distributed across multiple cores, and concurrency translates into real parallelism.

Even with 4 cores, FactorialsWithGoroutines isn't 4x faster than the sequential version, since scheduling overhead and coordination costs limit scalability.

Takeaway for CPU-Bound workloads

Goroutines don’t help on single-core execution; they only add overhead.
Performance improves as you increase cores, but only when the workload can be parallelized.
More goroutines than available cores will not help; it may even hurt performance.

I/O-Bound Example

Now let’s look at I/O-bound work: making HTTP requests to a local server.

package main

import (
    "io"
    "log"
    "net/http"
    "runtime"
    "sync"
)

func HTTPHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("Hello, World!"))
}

func MakeHTTPRequest(httpClient *http.Client, url string) {
    req, err := http.NewRequest(http.MethodGet, url, nil)
    if err != nil {
        log.Fatal(err)
    }

    resp, err := httpClient.Do(req)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()

    _, err = io.ReadAll(resp.Body)
    if err != nil {
        log.Fatal(err)
    }
}

func MakeHTTPRequests(httpClient *http.Client, url string, n int) {
    for i := 0; i < n; i++ {
        MakeHTTPRequest(httpClient, url)
    }
}

func MakeHTTPRequestsWithGoroutines(httpClient *http.Client, url string, n int) {
    wg := sync.WaitGroup{}
    for i := 0; i < n; i++ {
        wg.Add(1)
        go func() {
            MakeHTTPRequest(httpClient, url)
            wg.Done()
        }()
    }
    wg.Wait()
}

func main() {
    httpServer := http.Server{
        Addr:    ":8081",
        Handler: http.HandlerFunc(HTTPHandler),
    }
    httpServer.ListenAndServe()
}

To benchmark, start the local server in one terminal (go run main.go) and run the tests in another.

Benchmarks

package main

import (
    "net/http"
    "testing"
    "time"
)

var (
    requestsCount = 100
    url           = "http://localhost:8081"

    // HTTP client with optimized settings for concurrent requests
    httpClient = &http.Client{
        Transport: &http.Transport{
            MaxIdleConns:        0,                // Maximum idle connections, 0 means no limit
            MaxIdleConnsPerHost: requestsCount,    // Maximum idle connections per host
            IdleConnTimeout:     90 * time.Second, // Idle connection timeout
        },
    }
)

// BenchmarkMakeHTTPRequests benchmarks the sequential HTTP requests function
func BenchmarkMakeHTTPRequests(b *testing.B) {
    for i := 0; i < b.N; i++ {
        MakeHTTPRequests(httpClient, url, requestsCount)
    }
}

// BenchmarkMakeHTTPRequestsWithGoroutines benchmarks the concurrent HTTP requests function
func BenchmarkMakeHTTPRequestsWithGoroutines(b *testing.B) {
    for i := 0; i < b.N; i++ {
        MakeHTTPRequestsWithGoroutines(httpClient, url, requestsCount)
    }
}

GOMAXPROCS=1 (single core):

# first run
GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests                 1000   4959306 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines   1000   1889895 ns/op # goroutine version ~62% less time

# second run
GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests                1000     5075507 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines  1000     1875215 ns/op # goroutine version ~63% less time

Even on a single core, goroutines improve throughput significantly because while one goroutine waits for I/O, another can make progress.

GOMAXPROCS=2:

GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests-2                1000   6070332 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines-2  1000   1485643 ns/op # goroutine version ~76% less time


GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests-2                 1000   6001653 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines-2   1000   1480072 ns/op # goroutine version ~75% less time

MakeHTTPRequestsWithGoroutines remains much faster, and scaling cores helps only marginally.

GOMAXPROCS=4:

# first run
GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests-4                 1000   6073657 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines-4   1000   1467110 ns/op # goroutine version ~76% less time

# second run
GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests-4                  1000    6807084 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines-4    1000    1482042 ns/op # goroutine version ~78% less time

With 4 cores, the benefit of goroutines is still strong, but adding more cores doesn’t drastically change performance; the bottleneck is I/O, not CPU.

Takeaway for I/O-Bound workloads:

Goroutines excel at I/O-bound tasks, even on a single core.
Scaling cores beyond a certain point doesn’t add much benefit because I/O remains the bottleneck, not CPU.

Concurrency without Parallelism

The benchmarks in this article illustrate an important reality:

For CPU-bound workloads on a single core, goroutines add concurrency without parallelism, which makes the execution slower.
For CPU-bound workloads on multiple cores, goroutines enable parallelism, which makes the execution faster.
For I/O-bound workloads, goroutines provide significant benefits even on a single core; concurrency hides latency.

Conclusion

Goroutines are not a silver bullet. Their impact depends heavily on the workload and CPU availability:

Use goroutines for I/O-bound workloads (network calls, file I/O, database queries). They improve responsiveness and throughput by hiding latency.
Be cautious with CPU-bound workloads. Goroutines won’t help unless you have multiple cores and the workload can be parallelized.
Remember: Concurrency is not Parallelism. Without enough cores or the right workload type, goroutines may add overhead instead of improving performance.