Goroutines are one of Go’s most powerful features; they make it easy to write concurrent programs by abstracting away thread management and scheduling. However, as with any powerful tool, misuse can lead to inefficiency or, worse still, subpar performance. In this article, we’ll explore when goroutines actually become a liability by examining two fundamental categories of workloads: CPU-bound and I/O-bound workloads.
Concurrency vs. Parallelism
Before diving into code, let’s clarify two concepts:
- Concurrency: Dealing with multiple tasks at once (e.g. via goroutines). Tasks may or may not run in parallel (https://go.dev/blog/waza-talk).
- Parallelism: Executing multiple tasks at the same time on multiple CPU cores.
A program can be concurrent without being parallel (e.g., goroutines running on a single core). Whether concurrency translates into parallelism depends on the type of workload and the availability of CPUs.
Workload Types: CPU-Bound vs I/O-Bound
- CPU-Bound Work: The bottleneck is computation (e.g., calculating factorials, prime numbers, cryptographic hashing). The CPU is constantly busy, leaving little room for overlap.
- I/O-Bound Work: The bottleneck is waiting for input/output (e.g., network calls, disk reads). While one goroutine waits, another can execute, making concurrency very effective.
CPU-Bound Example
Here’s a simple Factorial function and two versions of using it for calculating the factorial of multiple numbers: sequentially and with goroutines.
package factorial
import (
"math/big"
"sync"
)
func Factorial(n int64) *big.Int {
result := big.NewInt(1)
for i := int64(2); i <= n; i++ {
result.Mul(result, big.NewInt(i))
}
return result
}
func Factorials(nSlice ...int64) {
for _, n := range nSlice {
Factorial(n) // we don't care about the result
}
}
func FactorialsWithGoroutines(nSlice ...int64) {
wg := sync.WaitGroup{}
for _, n := range nSlice {
wg.Add(1)
go func(n int64) {
Factorial(n) // we don't care about the result
wg.Done()
}(n)
}
wg.Wait()
}
Benchmarks
This is the System Report for the machine I would be running these benchmarks on:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro18,3
Chip: Apple M1 Pro
Total Number of Cores: 8 (6 performance and 2 efficiency)
Memory: 16 GB
Benchmarking Factorials and FactorialsWithGoroutines with different GOMAXPROCS to understand how they perform in different scenarios.
GOMAXPROCScontrols the number of operating system threads allocated to goroutines in your program. The default value ofGOMAXPROCSis the number of CPUs visible to the program at startup
To keep the benchmark results consistent, we would be turning off GOGC for all benchmarks in this article.package factorial
import (
"testing"
)
var numbers = reverseInts(1500, 500) // [1500, 1499, ..., 501, 500]
func reverseInts(start, end int64) []int64 {
arr := []int64{}
for i := start; i >= end; i-- {
arr = append(arr, i)
}
return arr
}
func BenchmarkFactorials(b *testing.B) {
for i := 0; i < b.N; i++ {
Factorials(numbers...)
}
}
func BenchmarkFactorialsWithGoroutines(b *testing.B) {
for i := 0; i < b.N; i++ {
FactorialsWithGoroutines(numbers...)
}
}
GOMAXPROCS=1 (single core):
Note: GOMAXPROCS=1 doesn’t mean the machine has only one core, it just restricts Go to a single logical processor.# first run
GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials 1000 38572234 ns/op # sequential version ~54% less time
BenchmarkFactorialsWithGoroutines 1000 84424882 ns/op
# second run
GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials 1000 39457195 ns/op # sequential version ~54% less time
BenchmarkFactorialsWithGoroutines 1000 85951160 ns/op
The concurrent version (FactorialsWithGoroutines) is actually slower than the sequential version because goroutines add scheduling overhead without providing parallelism.
GOMAXPROCS=2:
# first run
GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials-2 1000 39183753 ns/op # sequential version ~19% less time
BenchmarkFactorialsWithGoroutines-2 1000 48595899 ns/op
# second run
GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials-2 1000 38585135 ns/op # sequential version ~13% less time
BenchmarkFactorialsWithGoroutines-2 1000 44108161 ns/op
With 2 cores, FactorialsWithGoroutines improve slightly but still lag behind the sequential version due to scheduling overhead.
GOMAXPROCS=4:
# first run
GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials-4 1000 39235967 ns/op
BenchmarkFactorialsWithGoroutines-4 1000 27673082 ns/op # goroutine version ~29% less time
# second run
GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFactorials-4 1000 39935345 ns/op
BenchmarkFactorialsWithGoroutines-4 1000 28286147 ns/op # goroutine version ~29% less time
Finally, with 4 cores, FactorialsWithGoroutines start to shine; work is distributed across multiple cores, and concurrency translates into real parallelism.
Even with 4 cores, FactorialsWithGoroutines isn't 4x faster than the sequential version, since scheduling overhead and coordination costs limit scalability.Takeaway for CPU-Bound workloads
- Goroutines don’t help on single-core execution; they only add overhead.
- Performance improves as you increase cores, but only when the workload can be parallelized.
- More goroutines than available cores will not help; it may even hurt performance.
I/O-Bound Example
Now let’s look at I/O-bound work: making HTTP requests to a local server.
package main
import (
"io"
"log"
"net/http"
"runtime"
"sync"
)
func HTTPHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("Hello, World!"))
}
func MakeHTTPRequest(httpClient *http.Client, url string) {
req, err := http.NewRequest(http.MethodGet, url, nil)
if err != nil {
log.Fatal(err)
}
resp, err := httpClient.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
_, err = io.ReadAll(resp.Body)
if err != nil {
log.Fatal(err)
}
}
func MakeHTTPRequests(httpClient *http.Client, url string, n int) {
for i := 0; i < n; i++ {
MakeHTTPRequest(httpClient, url)
}
}
func MakeHTTPRequestsWithGoroutines(httpClient *http.Client, url string, n int) {
wg := sync.WaitGroup{}
for i := 0; i < n; i++ {
wg.Add(1)
go func() {
MakeHTTPRequest(httpClient, url)
wg.Done()
}()
}
wg.Wait()
}
func main() {
httpServer := http.Server{
Addr: ":8081",
Handler: http.HandlerFunc(HTTPHandler),
}
httpServer.ListenAndServe()
}
To benchmark, start the local server in one terminal (go run main.go) and run the tests in another.Benchmarks
package main
import (
"net/http"
"testing"
"time"
)
var (
requestsCount = 100
url = "http://localhost:8081"
// HTTP client with optimized settings for concurrent requests
httpClient = &http.Client{
Transport: &http.Transport{
MaxIdleConns: 0, // Maximum idle connections, 0 means no limit
MaxIdleConnsPerHost: requestsCount, // Maximum idle connections per host
IdleConnTimeout: 90 * time.Second, // Idle connection timeout
},
}
)
// BenchmarkMakeHTTPRequests benchmarks the sequential HTTP requests function
func BenchmarkMakeHTTPRequests(b *testing.B) {
for i := 0; i < b.N; i++ {
MakeHTTPRequests(httpClient, url, requestsCount)
}
}
// BenchmarkMakeHTTPRequestsWithGoroutines benchmarks the concurrent HTTP requests function
func BenchmarkMakeHTTPRequestsWithGoroutines(b *testing.B) {
for i := 0; i < b.N; i++ {
MakeHTTPRequestsWithGoroutines(httpClient, url, requestsCount)
}
}
GOMAXPROCS=1 (single core):
# first run
GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests 1000 4959306 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines 1000 1889895 ns/op # goroutine version ~62% less time
# second run
GOGC=off GOMAXPROCS=1 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests 1000 5075507 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines 1000 1875215 ns/op # goroutine version ~63% less time
Even on a single core, goroutines improve throughput significantly because while one goroutine waits for I/O, another can make progress.
GOMAXPROCS=2:
GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests-2 1000 6070332 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines-2 1000 1485643 ns/op # goroutine version ~76% less time
GOGC=off GOMAXPROCS=2 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests-2 1000 6001653 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines-2 1000 1480072 ns/op # goroutine version ~75% less time
MakeHTTPRequestsWithGoroutines remains much faster, and scaling cores helps only marginally.
GOMAXPROCS=4:
# first run
GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests-4 1000 6073657 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines-4 1000 1467110 ns/op # goroutine version ~76% less time
# second run
GOGC=off GOMAXPROCS=4 go test -run none -bench . -benchtime=1000x
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkMakeHTTPRequests-4 1000 6807084 ns/op
BenchmarkMakeHTTPRequestsWithGoroutines-4 1000 1482042 ns/op # goroutine version ~78% less time
With 4 cores, the benefit of goroutines is still strong, but adding more cores doesn’t drastically change performance; the bottleneck is I/O, not CPU.
Takeaway for I/O-Bound workloads:
- Goroutines excel at I/O-bound tasks, even on a single core.
- Scaling cores beyond a certain point doesn’t add much benefit because I/O remains the bottleneck, not CPU.
Concurrency without Parallelism
The benchmarks in this article illustrate an important reality:
- For CPU-bound workloads on a single core, goroutines add concurrency without parallelism, which makes the execution slower.
- For CPU-bound workloads on multiple cores, goroutines enable parallelism, which makes the execution faster.
- For I/O-bound workloads, goroutines provide significant benefits even on a single core; concurrency hides latency.
Conclusion
Goroutines are not a silver bullet. Their impact depends heavily on the workload and CPU availability:
- Use goroutines for I/O-bound workloads (network calls, file I/O, database queries). They improve responsiveness and throughput by hiding latency.
- Be cautious with CPU-bound workloads. Goroutines won’t help unless you have multiple cores and the workload can be parallelized.
- Remember: Concurrency is not Parallelism. Without enough cores or the right workload type, goroutines may add overhead instead of improving performance.
