Our GPU Was Idle 77% of the Time. Here's How We Fixed It

A practical guide to eliminating data transfer bottlenecks in PyTorch — achieving 1.5x speedup with pinned memory, CUDA streams, and GPU Direct Storage.

We assumed the GPU was our bottleneck. We were wrong.

While training a transformer model, I noticed something strange in the profiler output: the CPU was spending 77% of its time on cudaMemcpyAsync. Our expensive A100 GPU wasn't compute-bound — it was starving for data.

This post covers how we diagnosed the problem, fixed it with three increasingly aggressive optimizations, and hit the next wall. If you're training models on large datasets and haven't profiled your data pipeline, you might be leaving significant performance on the table.

The Setup

We're training nanoTabPFN, a transformer for tabular data. Training data lives in HDF5 files: 30,000 samples, each with 5,000 rows and 5 features. Hardware: NVIDIA A100-SXM4–80GB.

The original data loading code was textbook PyTorch:

with h5py.File(filename, "r") as f:
    for step in range(num_steps):
        x = torch.from_numpy(f["X"][ptr:end])
        y = torch.from_numpy(f["y"][ptr:end])
        yield dict(x=x.to(device), y=y.to(device))

Simple. Correct. And devastatingly slow.

Profile First, Optimize Later

Before touching any code, we ran PyTorch's built-in profiler:

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as prof:
    train(model, prior)

print(prof.key_averages().table(sort_by="cpu_time_total"))

The results were shocking:

Operation	CPU Time	% of Total
`cudaMemcpyAsync`	44,084ms	76.78%
`cudaMalloc`	7,081ms	12.33%
`cudaLaunchKernel`	645ms	1.12%
`aten::bmm`	180ms	0.31%

The GPU was doing matrix multiplications in milliseconds while the CPU spent 44 seconds copying data.

Understanding the Problem

The .to(device) call in PyTorch is synchronous by default. Here's the hidden pipeline:

h5py reads from disk → CPU memory (pageable)
PyTorch allocates → CPU staging buffer
cudaMemcpy → GPU memory (blocks until complete)
GPU computes → while CPU waits for step 1

The GPU sits idle during steps 1–3. With 5,000-row samples at float32, each batch transfer is ~120MB. That's 12GB of sequential transfers over 100 steps.

Fix #1: Pinned Memory + Non-blocking Transfers

The first optimization: use page-locked (pinned) memory with async transfers.

# Before: synchronous, pageable memory
x = torch.from_numpy(x_np).to(device)

# After: pinned memory, async transfer
x = torch.from_numpy(x_np).pin_memory().to(device, non_blocking=True)

Why this works: Pinned memory is DMA-accessible — the GPU can read it directly without CPU intervention. Combined with non_blocking=True, the transfer happens in the background while the CPU continues working.

Impact: cudaMemcpyAsync time dropped from 44s to ~4s.

Fix #2: CUDA Streams for True Overlap

Non-blocking transfers alone aren't enough. By default, operations on the same CUDA stream are serialized. We need a separate stream for data transfer:

class PriorDumpDataLoader:
    def __init__(self, ...):
        self.transfer_stream = torch.cuda.Stream()
    
    def __iter__(self):
        # Pre-fill buffer with first batches
        vram_buffer = [self._load_to_vram(f) for _ in range(prefetch)]
        
        for step in range(num_steps):
            batch = vram_buffer.pop(0)  # Already in VRAM
            
            # Prefetch next batch on separate stream
            with torch.cuda.stream(self.transfer_stream):
                next_batch = self._load_to_vram(f)
            vram_buffer.append(next_batch)
            
            # Sync before yielding
            torch.cuda.current_stream().wait_stream(self.transfer_stream)
            yield batch

This is double buffering: while the GPU processes batch N, the CPU+DMA engine load batch N+1. The GPU never waits.

Fix #3: GPU Direct Storage (GDS)

The ultimate optimization: bypass the CPU entirely.

NVIDIA's GPUDirect Storage reads directly from NVMe to GPU memory:

import kvikio
import cupy as cp

# Allocate GPU buffer
x_gpu = cp.empty((batch_size, seq_len, features), dtype=cp.float32)

# Direct read: NVMe → GPU (no CPU copy)
with kvikio.CuFile("data.bin", "r") as f:
    f.pread(x_gpu, file_offset=offset)

# Zero-copy to PyTorch
x = torch.as_tensor(x_gpu, device="cuda")

The catch: GDS requires raw binary files. HDF5 has headers that need CPU parsing. We added automatic conversion on first run:

def convert_h5_to_raw(h5_filename):
    with h5py.File(h5_filename, "r") as f:
        X = f["X"][:].astype(np.float32)
        y = f["y"][:].astype(np.float32)
    X.tofile(f"{base}_X.bin")
    y.tofile(f"{base}_y.bin")

Results

Metric	Baseline	Optimized	Speedup
Total time (100 steps)	68.75s	45.30s	1.52x
`cudaMemcpyAsync` CPU	44,084ms	268ms	164x
Steps/sec	1.5	2.2	1.47x

Memory transfer overhead dropped from 77% to <1% of CPU time.

The New Bottleneck

With data loading solved, the profile looks completely different:

Operation	CPU Time	% of Total
`Command Buffer Full`	23,450ms	46.91%
`cudaLaunchKernel`	10,733ms	21.47%
`cudaMalloc`	5,607ms	11.22%

The GPU is now saturated. "Command Buffer Full" means the GPU can't keep up with kernel submissions. This is exactly what we want — the GPU is the bottleneck, not data loading.

The remaining compute bottleneck is attention (aten::bmm at 45% CUDA time). With 5,000-row sequences, attention's O(n²) scaling dominates. Flash Attention is the next optimization.

Key Takeaways

Async is not automatic. non_blocking=True does nothing without proper stream management.

Pinned memory matters. 10x+ difference for large transfers.

GDS has constraints. True zero-copy requires raw binary files, GDS-compatible NVMe, and proper alignment.

Know when to stop. Once you're GPU-bound, data loading optimizations won't help. Move to model architecture changes.

Quick Reference

Technique	What it does	When to use
`pin_memory()`	Page-locked CPU memory	Always for GPU training
`non_blocking=True`	Async H2D transfer	With CUDA streams
CUDA Streams	Parallel transfer/compute	Large batch sizes
Double buffering	Prefetch next batch	I/O-bound workloads
GDS (kvikio)	Disk → GPU direct	Large sequential reads

Code

All code is available at github.com/stprnvsh/nanoTabPFN:

# Baseline
python train.py --profile --steps=100 --batch-size=6

# Optimized with GDS
python train_optimized.py --gds-bin --batch-size=4 --steps=200

# With Flash Attention
python train_optimized.py --flash --gds-bin --batch-size=8 --steps=200