Building a Production AI Data Pipelin…

For ML engineers, AI infrastructure leads, and technical founders at Indian AI companies building production machine learning systems.

The Data Infrastructure Problem Most Indian AI Teams Face

Indian AI teams building serious products encounter the same infrastructure challenge at different scales: the training data lives in one place, the GPU compute lives in another, and the cost of moving data between them — either in money or in latency — is the hidden tax on every training run.

For a team training on a few gigabytes, this is a minor inconvenience. For a team training on hundreds of gigabytes of Indian language data, healthcare records, financial documents, or user-generated content with data residency obligations — the architecture of the storage layer is not a secondary concern. It determines training throughput, compliance posture, and the cost model for data access as the dataset scales.

This guide describes the architecture that solves this problem: object storage as the persistent data layer, GPU compute as a stateless processing layer that reads from and writes to storage, and a pipeline structure that makes data access efficient at scale.

The Architecture: Storage-Centric AI Infrastructure

The core principle is that the data is permanent and the compute is ephemeral. Training jobs start, consume data from object storage, produce model artifacts that are written back to object storage, and terminate. The storage persists between runs. The GPU instance does not need to.

This architecture has three benefits: training jobs can be restarted from any checkpoint without data loss, multiple training jobs can access the same dataset simultaneously without data copying, and the cost of GPU compute is paid only for the duration of active training — not for idle time while data is being managed.

Storage Layer (IBEE Object Storage)

The storage layer holds everything that needs to persist between training runs:

Raw data bucket — source datasets as collected, never modified. Every file has a stable key that uniquely identifies it across all training runs. Indian language corpora, user interaction logs, labelled medical images, financial transaction records — whatever the domain, raw data is immutable in this bucket.

Processed data bucket — training-ready datasets produced from raw data by preprocessing pipelines. Stored as WebDataset shards, Parquet files, or TFRecord files depending on the training framework. Versioned by dataset version number in the key prefix.

Model artifacts bucket — checkpoints written during training, final model weights, evaluation outputs. Each training run writes to a prefix identified by a unique run ID.

Experiment cache bucket — preprocessed features, tokenised text, cached embeddings — intermediate representations that are expensive to compute and reused across multiple training experiments.

Compute Layer (GPU Training Jobs)

Training jobs are Python processes running on GPU instances. They connect to object storage at startup, stream training data through the dataloader, write checkpoints at configured intervals, and write final artifacts at completion.

The GPU instance has no persistent storage obligation. It is a processing node that reads from and writes to the storage layer. When training completes, the instance can be terminated without any data loss — everything produced is already in object storage.

Data Loading: The Training Throughput Bottleneck

The most common performance problem in this architecture is data loading throughput. A GPU that can process data faster than the dataloader can feed it is GPU time wasted. On object storage, data loading throughput depends on three variables: the number of objects (individual GET request overhead), the size of each object, and the parallelism of the dataloader.

Shard your dataset to reduce GET requests. Instead of storing 1 million individual training examples as individual files — 1 million GET requests per epoch — package them into shards of 500–1000 examples each using the WebDataset format. One thousand shards of 1000 examples each means 1000 GET requests per epoch rather than 1 million. Each GET retrieves a tar archive of examples that the dataloader processes locally.

Use parallel prefetching. Both PyTorch DataLoader and TensorFlow's tf.data pipeline support parallel prefetching — loading the next batch of shards while the GPU processes the current batch. Configure num_workers (PyTorch) or num_parallel_calls (TensorFlow) to match the number of CPU cores available on the training instance.

Cache to local NVMe on the first epoch. For training runs that will run for multiple epochs, download all shards to local NVMe on the first epoch pass and serve subsequent epochs from local disk. The local cache eliminates repeated object storage GET requests and fully decouples data loading throughput from storage network bandwidth after the first epoch.

Connecting Training Jobs to IBEE

Training frameworks that use the AWS S3 client connect to IBEE by setting the endpoint URL. The three environment variables needed:

With AWS_ENDPOINT_URL set, the AWS SDK automatically routes all S3 calls to IBEE. boto3, the AWS CLI, and any library that uses the AWS SDK (including PyTorch's S3 DataLoader integration and Hugging Face Datasets) will connect to IBEE without any code changes.

For PyTorch with S3-compatible storage using WebDataset:

Checkpoint Strategy for Long Training Runs

Training runs that take 12–48 hours without a checkpoint strategy lose all progress if the instance is interrupted. Write checkpoints to object storage at regular intervals — every 1000 steps or every hour, whichever is more frequent.

Resume from the latest checkpoint at training startup by listing the checkpoints prefix and loading the highest step number.

Data Residency and Indian AI

For Indian AI companies building models on Indian user data — health records, financial transactions, language data from Indian users — the data residency of the training infrastructure matters alongside the residency of the production serving infrastructure.

Training a model on data stored in AWS S3 Mumbai means Indian user data resides on infrastructure governed by US federal law during training. The model weights produced from that training are model artefacts derived from Indian user data — the data governance question extends to the artefacts, not just the raw data.

IBEE's India-sovereign storage provides a clean answer: raw data, processed training data, and model artefacts all reside on Indian infrastructure, under Indian law. For AI companies with enterprise customers, government clients, or products in regulated sectors — this is an architecture decision with compliance implications beyond the technical.

Building a Production AI Data Pipeline on India-Sovereign Infrastructure