Our GPU Was Idle 77% of the Time. Here's How We Fixed It

Written by hacker22580379 | Published 2026/01/16
Tech Story Tags: machine-learning | pytorch | ai-training | cuda | gpu | performance-optimization | pinned-memory | non-blocking-streams

TLDRPyTorch's data pipeline is too slow. The CPU was spending 77% of its time on 'cudaMemcpyAsync' The GPU was doing matrix multiplications inmilliseconds. Pinned memory and non-blocking streams can speed up data transfers.via the TL;DR App

A practical guide to eliminating data transfer bottlenecks in PyTorch — achieving 1.5x speedup with pinned memory, CUDA streams, and GPU Direct Storage.


We assumed the GPU was our bottleneck. We were wrong.

While training a transformer model, I noticed something strange in the profiler output: the CPU was spending 77% of its time on cudaMemcpyAsync. Our expensive A100 GPU wasn't compute-bound — it was starving for data.

This post covers how we diagnosed the problem, fixed it with three increasingly aggressive optimizations, and hit the next wall. If you're training models on large datasets and haven't profiled your data pipeline, you might be leaving significant performance on the table.


The Setup

We're training nanoTabPFN, a transformer for tabular data. Training data lives in HDF5 files: 30,000 samples, each with 5,000 rows and 5 features. Hardware: NVIDIA A100-SXM4–80GB.

The original data loading code was textbook PyTorch:

with h5py.File(filename, "r") as f:
    for step in range(num_steps):
        x = torch.from_numpy(f["X"][ptr:end])
        y = torch.from_numpy(f["y"][ptr:end])
        yield dict(x=x.to(device), y=y.to(device))

Simple. Correct. And devastatingly slow.


Profile First, Optimize Later

Before touching any code, we ran PyTorch's built-in profiler:

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as prof:
    train(model, prior)

print(prof.key_averages().table(sort_by="cpu_time_total"))

The results were shocking:

Operation

CPU Time

% of Total

cudaMemcpyAsync

44,084ms

76.78%

cudaMalloc

7,081ms

12.33%

cudaLaunchKernel

645ms

1.12%

aten::bmm

180ms

0.31%

The GPU was doing matrix multiplications in milliseconds while the CPU spent 44 seconds copying data.


Understanding the Problem

The .to(device) call in PyTorch is synchronous by default. Here's the hidden pipeline:

  1. h5py reads from disk → CPU memory (pageable)
  2. PyTorch allocates → CPU staging buffer
  3. cudaMemcpy → GPU memory (blocks until complete)
  4. GPU computes → while CPU waits for step 1

The GPU sits idle during steps 1–3. With 5,000-row samples at float32, each batch transfer is ~120MB. That's 12GB of sequential transfers over 100 steps.


Fix #1: Pinned Memory + Non-blocking Transfers

The first optimization: use page-locked (pinned) memory with async transfers.

# Before: synchronous, pageable memory
x = torch.from_numpy(x_np).to(device)

# After: pinned memory, async transfer
x = torch.from_numpy(x_np).pin_memory().to(device, non_blocking=True)

Why this works: Pinned memory is DMA-accessible — the GPU can read it directly without CPU intervention. Combined with non_blocking=True, the transfer happens in the background while the CPU continues working.

Impact: cudaMemcpyAsync time dropped from 44s to ~4s.


Fix #2: CUDA Streams for True Overlap

Non-blocking transfers alone aren't enough. By default, operations on the same CUDA stream are serialized. We need a separate stream for data transfer:

class PriorDumpDataLoader:
    def __init__(self, ...):
        self.transfer_stream = torch.cuda.Stream()
    
    def __iter__(self):
        # Pre-fill buffer with first batches
        vram_buffer = [self._load_to_vram(f) for _ in range(prefetch)]
        
        for step in range(num_steps):
            batch = vram_buffer.pop(0)  # Already in VRAM
            
            # Prefetch next batch on separate stream
            with torch.cuda.stream(self.transfer_stream):
                next_batch = self._load_to_vram(f)
            vram_buffer.append(next_batch)
            
            # Sync before yielding
            torch.cuda.current_stream().wait_stream(self.transfer_stream)
            yield batch

This is double buffering: while the GPU processes batch N, the CPU+DMA engine load batch N+1. The GPU never waits.


Fix #3: GPU Direct Storage (GDS)

The ultimate optimization: bypass the CPU entirely.

NVIDIA's GPUDirect Storage reads directly from NVMe to GPU memory:

import kvikio
import cupy as cp

# Allocate GPU buffer
x_gpu = cp.empty((batch_size, seq_len, features), dtype=cp.float32)

# Direct read: NVMe → GPU (no CPU copy)
with kvikio.CuFile("data.bin", "r") as f:
    f.pread(x_gpu, file_offset=offset)

# Zero-copy to PyTorch
x = torch.as_tensor(x_gpu, device="cuda")

The catch: GDS requires raw binary files. HDF5 has headers that need CPU parsing. We added automatic conversion on first run:

def convert_h5_to_raw(h5_filename):
    with h5py.File(h5_filename, "r") as f:
        X = f["X"][:].astype(np.float32)
        y = f["y"][:].astype(np.float32)
    X.tofile(f"{base}_X.bin")
    y.tofile(f"{base}_y.bin")

Results

Metric

Baseline

Optimized

Speedup

Total time (100 steps)

68.75s

45.30s

1.52x

cudaMemcpyAsync CPU

44,084ms

268ms

164x

Steps/sec

1.5

2.2

1.47x

Memory transfer overhead dropped from 77% to <1% of CPU time.


The New Bottleneck

With data loading solved, the profile looks completely different:

Operation

CPU Time

% of Total

Command Buffer Full

23,450ms

46.91%

cudaLaunchKernel

10,733ms

21.47%

cudaMalloc

5,607ms

11.22%

The GPU is now saturated. "Command Buffer Full" means the GPU can't keep up with kernel submissions. This is exactly what we want — the GPU is the bottleneck, not data loading.

The remaining compute bottleneck is attention (aten::bmm at 45% CUDA time). With 5,000-row sequences, attention's O(n²) scaling dominates. Flash Attention is the next optimization.


Key Takeaways

Async is not automatic. non_blocking=True does nothing without proper stream management.

Pinned memory matters. 10x+ difference for large transfers.

GDS has constraints. True zero-copy requires raw binary files, GDS-compatible NVMe, and proper alignment.

Know when to stop. Once you're GPU-bound, data loading optimizations won't help. Move to model architecture changes.


Quick Reference

Technique

What it does

When to use

pin_memory()

Page-locked CPU memory

Always for GPU training

non_blocking=True

Async H2D transfer

With CUDA streams

CUDA Streams

Parallel transfer/compute

Large batch sizes

Double buffering

Prefetch next batch

I/O-bound workloads

GDS (kvikio)

Disk → GPU direct

Large sequential reads


Code

All code is available at github.com/stprnvsh/nanoTabPFN:

# Baseline
python train.py --profile --steps=100 --batch-size=6

# Optimized with GDS
python train_optimized.py --gds-bin --batch-size=4 --steps=200

# With Flash Attention
python train_optimized.py --flash --gds-bin --batch-size=8 --steps=200


Published by HackerNoon on 2026/01/16