How Goroutines Behave on CPU-Bound vs I/O-Bound Tasks

Goroutines are one of Go’s most powerful features; they make it easy to write concurrent programs by abstracting away thread management and scheduling. However, as with any powerful tool, misuse can lead to inefficiency or, worse still, subpar performance. In this article, we’ll explore when goroutines actually become a liability by examining two fundamental categories of workloads: CPU-bound and I/O-bound workloads. when goroutines actually become a liability CPU-bound I/O-bound Concurrency vs. Parallelism Before diving into code, let’s clarify two concepts: Concurrency: Dealing with multiple tasks at once (e.g. via goroutines). Tasks may or may not run in parallel (https://go.dev/blog/waza-talk).Parallelism: Executing multiple tasks at the same time on multiple CPU cores. Concurrency: Dealing with multiple tasks at once (e.g. via goroutines). Tasks may or may not run in parallel (https://go.dev/blog/waza-talk). Concurrency: https://go.dev/blog/waza-talk Parallelism: Executing multiple tasks at the same time on multiple CPU cores. Parallelism: at the same time A program can be concurrent without being parallel (e.g., goroutines running on a single core). Whether concurrency translates into parallelism depends on the type of workload and the availability of CPUs. Workload Types: CPU-Bound vs I/O-Bound CPU-Bound Work: The bottleneck is computation (e.g., calculating factorials, prime numbers, cryptographic hashing). The CPU is constantly busy, leaving little room for overlap.I/O-Bound Work: The bottleneck is waiting for input/output (e.g., network calls, disk reads). While one goroutine waits, another can execute, making concurrency very effective. CPU-Bound Work: The bottleneck is computation (e.g., calculating factorials, prime numbers, cryptographic hashing). The CPU is constantly busy, leaving little room for overlap. CPU-Bound Work: I/O-Bound Work: The bottleneck is waiting for input/output (e.g., network calls, disk reads). While one goroutine waits, another can execute, making concurrency very effective. I/O-Bound Work CPU-Bound Example Here’s a simple Factorial function and two versions of using it for calculating the factorial of multiple numbers: sequentially and with goroutines. Factorial sequentially with goroutines package factorial import ( "math/big" "sync" ) func Factorial(n int64) *big.Int { result := big.NewInt(1) for i := int64(2); i <= n; i++ { result.Mul(result, big.NewInt(i)) } return result } func Factorials(nSlice ...int64) { for _, n := range nSlice { Factorial(n) // we don't care about the result } } func FactorialsWithGoroutines(nSlice ...int64) { wg := sync.WaitGroup{} for _, n := range nSlice { wg.Add(1) go func(n int64) { Factorial(n) // we don't care about the result wg.Done() }(n) } wg.Wait() } package factorial import ( "math/big" "sync" ) func Factorial(n int64) *big.Int { result := big.NewInt(1) for i := int64(2); i <= n; i++ { result.Mul(result, big.NewInt(i)) } return result } func Factorials(nSlice ...int64) { for _, n := range nSlice { Factorial(n) // we don't care about the result } } func FactorialsWithGoroutines(nSlice ...int64) { wg := sync.WaitGroup{} for _, n := range nSlice { wg.Add(1) go func(n int64) { Factorial(n) // we don't care about the result wg.Done() }(n) } wg.Wait() } Benchmarks Benchmarks This is the System Report for the machine I would be running these benchmarks on: This is the System Report for the machine I would be running these benchmarks on: Hardware Overview: Model Name: MacBook Pro Model Identifier: MacBookPro18,3 Chip: Apple M1 Pro Total Number of Cores: 8 (6 performance and 2 efficiency) Memory: 16 GB Hardware Overview: Model Name: MacBook Pro Model Identifier: MacBookPro18,3 Chip: Apple M1 Pro Total Number of Cores: 8 (6 performance and 2 efficiency) Memory: 16 GB Benchmarking Factorials and FactorialsWithGoroutines with different GOMAXPROCS to understand how they perform in different scenarios. Factorials FactorialsWithGoroutines GOMAXPROCS GOMAXPROCS controls the number of operating system threads allocated to goroutines in your program. The default value of GOMAXPROCS is the number of CPUs visible to the program at startup GOMAXPROCS GOMAXPROCS To keep the benchmark results consistent, we would be turning off GOGC for all benchmarks in this article. GOGC package factorial import ( "testing" ) var numbers = reverseInts(1500, 500) // [1500, 1499, ..., 501, 500] func reverseInts(start, end int64) []int64 { arr := []int64{} for i := start; i >= end; i-- { arr = append(arr, i) } return arr } func BenchmarkFactorials(b *testing.B) { for i := 0; i = end; i-- { arr = append(arr, i) } return arr } func BenchmarkFactorials(b *testing.B) { for i := 0; i < b.N; i++ { Factorials(numbers...) } } func BenchmarkFactorialsWithGoroutines(b *testing.B) { for i := 0; i < b.N; i++ { FactorialsWithGoroutines(numbers...) } } GOMAXPROCS=1 (single core): GOMAXPROCS=1 (single core): Note: GOMAXPROCS=1 doesn’t mean the machine has only one core, it just restricts Go to a single logical processor. GOMAXPROCS=1 # first run GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials 1000 38572234 ns/op # sequential version ~54% less time BenchmarkFactorialsWithGoroutines 1000 84424882 ns/op # second run GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials 1000 39457195 ns/op # sequential version ~54% less time BenchmarkFactorialsWithGoroutines 1000 85951160 ns/op # first run GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials 1000 38572234 ns/op # sequential version ~54% less time BenchmarkFactorialsWithGoroutines 1000 84424882 ns/op # second run GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials 1000 39457195 ns/op # sequential version ~54% less time BenchmarkFactorialsWithGoroutines 1000 85951160 ns/op The concurrent version (FactorialsWithGoroutines) is actually slower than the sequential version because goroutines add scheduling overhead without providing parallelism. FactorialsWithGoroutines GOMAXPROCS=2: GOMAXPROCS=2: # first run GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials-2 1000 39183753 ns/op # sequential version ~19% less time BenchmarkFactorialsWithGoroutines-2 1000 48595899 ns/op # second run GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials-2 1000 38585135 ns/op # sequential version ~13% less time BenchmarkFactorialsWithGoroutines-2 1000 44108161 ns/op # first run GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials-2 1000 39183753 ns/op # sequential version ~19% less time BenchmarkFactorialsWithGoroutines-2 1000 48595899 ns/op # second run GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials-2 1000 38585135 ns/op # sequential version ~13% less time BenchmarkFactorialsWithGoroutines-2 1000 44108161 ns/op With 2 cores, FactorialsWithGoroutines improve slightly but still lag behind the sequential version due to scheduling overhead. FactorialsWithGoroutines GOMAXPROCS=4: GOMAXPROCS=4: # first run GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials-4 1000 39235967 ns/op BenchmarkFactorialsWithGoroutines-4 1000 27673082 ns/op # goroutine version ~29% less time # second run GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials-4 1000 39935345 ns/op BenchmarkFactorialsWithGoroutines-4 1000 28286147 ns/op # goroutine version ~29% less time # first run GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials-4 1000 39235967 ns/op BenchmarkFactorialsWithGoroutines-4 1000 27673082 ns/op # goroutine version ~29% less time # second run GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkFactorials-4 1000 39935345 ns/op BenchmarkFactorialsWithGoroutines-4 1000 28286147 ns/op # goroutine version ~29% less time Finally, with 4 cores, FactorialsWithGoroutines start to shine; work is distributed across multiple cores, and concurrency translates into real parallelism. FactorialsWithGoroutines Even with 4 cores, FactorialsWithGoroutines isn't 4x faster than the sequential version, since scheduling overhead and coordination costs limit scalability. FactorialsWithGoroutines Takeaway for CPU-Bound workloads Goroutines don’t help on single-core execution; they only add overhead.Performance improves as you increase cores, but only when the workload can be parallelized.More goroutines than available cores will not help; it may even hurt performance. Goroutines don’t help on single-core execution; they only add overhead. don’t help on single-core execution; Performance improves as you increase cores, but only when the workload can be parallelized. More goroutines than available cores will not help; it may even hurt performance. I/O-Bound Example Now let’s look at I/O-bound work: making HTTP requests to a local server. package main import ( "io" "log" "net/http" "runtime" "sync" ) func HTTPHandler(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) w.Write([]byte("Hello, World!")) } func MakeHTTPRequest(httpClient *http.Client, url string) { req, err := http.NewRequest(http.MethodGet, url, nil) if err != nil { log.Fatal(err) } resp, err := httpClient.Do(req) if err != nil { log.Fatal(err) } defer resp.Body.Close() _, err = io.ReadAll(resp.Body) if err != nil { log.Fatal(err) } } func MakeHTTPRequests(httpClient *http.Client, url string, n int) { for i := 0; i < n; i++ { MakeHTTPRequest(httpClient, url) } } func MakeHTTPRequestsWithGoroutines(httpClient *http.Client, url string, n int) { wg := sync.WaitGroup{} for i := 0; i < n; i++ { wg.Add(1) go func() { MakeHTTPRequest(httpClient, url) wg.Done() }() } wg.Wait() } func main() { httpServer := http.Server{ Addr: ":8081", Handler: http.HandlerFunc(HTTPHandler), } httpServer.ListenAndServe() } package main import ( "io" "log" "net/http" "runtime" "sync" ) func HTTPHandler(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) w.Write([]byte("Hello, World!")) } func MakeHTTPRequest(httpClient *http.Client, url string) { req, err := http.NewRequest(http.MethodGet, url, nil) if err != nil { log.Fatal(err) } resp, err := httpClient.Do(req) if err != nil { log.Fatal(err) } defer resp.Body.Close() _, err = io.ReadAll(resp.Body) if err != nil { log.Fatal(err) } } func MakeHTTPRequests(httpClient *http.Client, url string, n int) { for i := 0; i < n; i++ { MakeHTTPRequest(httpClient, url) } } func MakeHTTPRequestsWithGoroutines(httpClient *http.Client, url string, n int) { wg := sync.WaitGroup{} for i := 0; i < n; i++ { wg.Add(1) go func() { MakeHTTPRequest(httpClient, url) wg.Done() }() } wg.Wait() } func main() { httpServer := http.Server{ Addr: ":8081", Handler: http.HandlerFunc(HTTPHandler), } httpServer.ListenAndServe() } To benchmark, start the local server in one terminal (go run main.go) and run the tests in another. go run main.go Benchmarks package main import ( "net/http" "testing" "time" ) var ( requestsCount = 100 url = "http://localhost:8081" // HTTP client with optimized settings for concurrent requests httpClient = &http.Client{ Transport: &http.Transport{ MaxIdleConns: 0, // Maximum idle connections, 0 means no limit MaxIdleConnsPerHost: requestsCount, // Maximum idle connections per host IdleConnTimeout: 90 * time.Second, // Idle connection timeout }, } ) // BenchmarkMakeHTTPRequests benchmarks the sequential HTTP requests function func BenchmarkMakeHTTPRequests(b *testing.B) { for i := 0; i < b.N; i++ { MakeHTTPRequests(httpClient, url, requestsCount) } } // BenchmarkMakeHTTPRequestsWithGoroutines benchmarks the concurrent HTTP requests function func BenchmarkMakeHTTPRequestsWithGoroutines(b *testing.B) { for i := 0; i < b.N; i++ { MakeHTTPRequestsWithGoroutines(httpClient, url, requestsCount) } } package main import ( "net/http" "testing" "time" ) var ( requestsCount = 100 url = "http://localhost:8081" // HTTP client with optimized settings for concurrent requests httpClient = &http.Client{ Transport: &http.Transport{ MaxIdleConns: 0, // Maximum idle connections, 0 means no limit MaxIdleConnsPerHost: requestsCount, // Maximum idle connections per host IdleConnTimeout: 90 * time.Second, // Idle connection timeout }, } ) // BenchmarkMakeHTTPRequests benchmarks the sequential HTTP requests function func BenchmarkMakeHTTPRequests(b *testing.B) { for i := 0; i < b.N; i++ { MakeHTTPRequests(httpClient, url, requestsCount) } } // BenchmarkMakeHTTPRequestsWithGoroutines benchmarks the concurrent HTTP requests function func BenchmarkMakeHTTPRequestsWithGoroutines(b *testing.B) { for i := 0; i < b.N; i++ { MakeHTTPRequestsWithGoroutines(httpClient, url, requestsCount) } } GOMAXPROCS=1 (single core): GOMAXPROCS=1 (single core): # first run GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests 1000 4959306 ns/op BenchmarkMakeHTTPRequestsWithGoroutines 1000 1889895 ns/op # goroutine version ~62% less time # second run GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests 1000 5075507 ns/op BenchmarkMakeHTTPRequestsWithGoroutines 1000 1875215 ns/op # goroutine version ~63% less time # first run GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests 1000 4959306 ns/op BenchmarkMakeHTTPRequestsWithGoroutines 1000 1889895 ns/op # goroutine version ~62% less time # second run GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests 1000 5075507 ns/op BenchmarkMakeHTTPRequestsWithGoroutines 1000 1875215 ns/op # goroutine version ~63% less time Even on a single core, goroutines improve throughput significantly because while one goroutine waits for I/O, another can make progress. GOMAXPROCS=2: GOMAXPROCS=2: GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests-2 1000 6070332 ns/op BenchmarkMakeHTTPRequestsWithGoroutines-2 1000 1485643 ns/op # goroutine version ~76% less time GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests-2 1000 6001653 ns/op BenchmarkMakeHTTPRequestsWithGoroutines-2 1000 1480072 ns/op # goroutine version ~75% less time GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests-2 1000 6070332 ns/op BenchmarkMakeHTTPRequestsWithGoroutines-2 1000 1485643 ns/op # goroutine version ~76% less time GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests-2 1000 6001653 ns/op BenchmarkMakeHTTPRequestsWithGoroutines-2 1000 1480072 ns/op # goroutine version ~75% less time MakeHTTPRequestsWithGoroutines remains much faster, and scaling cores helps only marginally. MakeHTTPRequestsWithGoroutines GOMAXPROCS=4: GOMAXPROCS=4: # first run GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests-4 1000 6073657 ns/op BenchmarkMakeHTTPRequestsWithGoroutines-4 1000 1467110 ns/op # goroutine version ~76% less time # second run GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests-4 1000 6807084 ns/op BenchmarkMakeHTTPRequestsWithGoroutines-4 1000 1482042 ns/op # goroutine version ~78% less time # first run GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests-4 1000 6073657 ns/op BenchmarkMakeHTTPRequestsWithGoroutines-4 1000 1467110 ns/op # goroutine version ~76% less time # second run GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x goos: darwin goarch: arm64 cpu: Apple M1 Pro BenchmarkMakeHTTPRequests-4 1000 6807084 ns/op BenchmarkMakeHTTPRequestsWithGoroutines-4 1000 1482042 ns/op # goroutine version ~78% less time With 4 cores, the benefit of goroutines is still strong, but adding more cores doesn’t drastically change performance; the bottleneck is I/O, not CPU. Takeaway for I/O-Bound workloads: Goroutines excel at I/O-bound tasks, even on a single core.Scaling cores beyond a certain point doesn’t add much benefit because I/O remains the bottleneck, not CPU. Goroutines excel at I/O-bound tasks, even on a single core. Scaling cores beyond a certain point doesn’t add much benefit because I/O remains the bottleneck, not CPU. Concurrency without Parallelism The benchmarks in this article illustrate an important reality: For CPU-bound workloads on a single core, goroutines add concurrency without parallelism, which makes the execution slower.For CPU-bound workloads on multiple cores, goroutines enable parallelism, which makes the execution faster.For I/O-bound workloads, goroutines provide significant benefits even on a single core; concurrency hides latency. For CPU-bound workloads on a single core, goroutines add concurrency without parallelism, which makes the execution slower. CPU-bound workloads on a single core For CPU-bound workloads on multiple cores, goroutines enable parallelism, which makes the execution faster. CPU-bound workloads on multiple cores For I/O-bound workloads, goroutines provide significant benefits even on a single core; concurrency hides latency. I/O-bound workloads single core Conclusion Goroutines are not a silver bullet. Their impact depends heavily on the workload and CPU availability: Use goroutines for I/O-bound workloads (network calls, file I/O, database queries). They improve responsiveness and throughput by hiding latency.Be cautious with CPU-bound workloads. Goroutines won’t help unless you have multiple cores and the workload can be parallelized.Remember: Concurrency is not Parallelism. Without enough cores or the right workload type, goroutines may add overhead instead of improving performance. Use goroutines for I/O-bound workloads (network calls, file I/O, database queries). They improve responsiveness and throughput by hiding latency. Use goroutines for I/O-bound workloads Be cautious with CPU-bound workloads. Goroutines won’t help unless you have multiple cores and the workload can be parallelized. Be cautious with CPU-bound workloads Remember: Concurrency is not Parallelism. Without enough cores or the right workload type, goroutines may add overhead instead of improving performance. Concurrency Parallelism