For ML engineers and AI infrastructure leads training deep learning models at Indian AI companies.
The Hidden Cost of a Waiting GPU
A GPU training a 7B parameter language model on an A100 processes roughly 2,000 tokens per second. At typical GPU rental rates in India of Rs.200–400 per A100-hour, every second of GPU idle time while the dataloader catches up costs money without producing any training progress.
GPU utilisation is typically monitored as the percentage of time the GPU is active (nvidia-smi shows this as GPU-Util). A training job showing 60% GPU utilisation is not running at 60% efficiency — it is running at 60% of what you are paying for and wasting 40% of your GPU budget on wait time.
The most common cause of low GPU utilisation in the first 50 epochs of a training run is data loading latency. The GPU processes a batch, then waits for the next batch to arrive from the dataloader. If the dataloader cannot produce batches at least as fast as the GPU can consume them, the GPU idles.
This guide covers how to diagnose this bottleneck and the five techniques that resolve it.
Diagnosing the Bottleneck
Before optimising, confirm that data loading is actually the bottleneck rather than a compute or memory constraint.
GPU-Util below 90% — run watch -n 1 nvidia-smi during training. If GPU-Util is consistently below 90%, the GPU has idle cycles.
Dataloader timing — add timing instrumentation around the data loading step in your training loop:
If load_time is consistently greater than 10–20% of train_time, data loading is the bottleneck.
Network I/O during training — monitor network throughput with iotop or nethogs. If network I/O spikes and drops during training (rather than being steady), the training loop is waiting for object storage fetches rather than processing continuously.
Fix 1 — Shard Your Dataset
Loading a dataset of 1 million individual image files from object storage means 1 million separate GET requests per epoch. Each GET request has network overhead — the cost of initiating the request regardless of file size. For small files (thumbnails, audio clips, text samples), this overhead dominates total data loading time.
WebDataset format packages training examples into tar archives (shards), each containing hundreds or thousands of examples. Loading 1000 shards of 1000 examples each costs 1000 GET requests per epoch rather than 1 million — a 1000x reduction in request overhead.
Create WebDataset shards using the webdataset Python library:
Upload shards to IBEE. Read them during training:
Fix 2 — Parallel Prefetching
The PyTorch DataLoader's num_workers parameter controls how many parallel worker processes load data. With num_workers=0 (the default), data loading happens on the main process, blocking training. With num_workers=8, eight parallel processes prefetch data while the GPU processes the previous batch.
For WebDataset pipelines loaded from S3-compatible storage, each worker process opens its own connection to the storage endpoint. The parallel fetch requests saturate available network bandwidth more efficiently than a single sequential fetch.
Set num_workers to match the number of CPU cores available on the training instance. Monitor CPU utilisation during training — if CPUs are underutilised, increase num_workers. If CPUs are at 100%, reduce it.
Fix 3 — First-Epoch Local Cache
For training runs that will iterate over the dataset multiple times (multiple epochs), download all training shards to local NVMe on the first epoch and serve subsequent epochs from local disk.
Local NVMe throughput (1–3 GB/s) eliminates the network dependency entirely after the initial download. The GPU never waits for network I/O after the first epoch.
Fix 4 — Compressed Shards for Network-Limited Environments
When network bandwidth between the training instance and storage is the constraint, compressing shards reduces the bytes transferred per shard. WebDataset supports .tar.gz shards natively.
The tradeoff: decompression adds CPU time. If the training instance has spare CPU capacity and network bandwidth is the bottleneck, compressed shards reduce transfer time more than they increase decompression time — net positive for data loading throughput.
Monitor CPU utilisation when using compressed shards. If CPUs are at 100% and GPU is still idle, the bottleneck has shifted from network to CPU decompression, and uncompressed shards are preferable.
Fix 5 — Storage Colocation with Compute
The lowest-latency option is to run training jobs on instances in the same data centre or network region as the storage. Network round-trip times within a data centre are sub-millisecond. Cross-region or cross-provider network paths add 10–100ms to every object fetch.
For training on IBEE object storage, running training jobs on compute instances in IBEE's infrastructure or in the same Indian network region eliminates the latency component entirely.
Combining the Fixes
The optimal configuration for training from S3-compatible object storage:
First run: download shards to local NVMe using parallel workers, use WebDataset for efficient shard reading, prefetch aggressively with multiple DataLoader workers.
Subsequent runs: read from local cache, enabling GPU utilisation above 95%.
The initial download cost is a one-time investment per dataset version. After the first epoch completes, the training job is fully decoupled from storage network performance.


