How to Store and Manage ML Datasets o…

For ML engineers, data scientists, and MLOps engineers at Indian AI teams managing training datasets, annotation pipelines, and model artifacts.

The Ad-Hoc ML Storage Problem

Most Indian ML teams start storing datasets on a researcher's laptop. The laptop becomes a shared server. The shared server becomes an NFS mount. The NFS mount gets a datasets_v2, datasets_v2_final, and datasets_v2_final_ACTUALLY_FINAL directory structure. Nobody can reproduce experiments from six months ago because nobody is certain which version of the dataset was used.

This is not a discipline problem. It is an infrastructure problem. Without a storage system that natively supports versioning, access control, and remote access, ML teams default to workarounds that create the reproducibility and collaboration problems that slow down research and production deployment.

Object storage with a deliberate organisation scheme solves all three problems.

The Bucket Structure

Organise ML storage into separate buckets by function rather than keeping everything in one flat bucket. The recommended structure:

`company-ml-raw` — Raw data as collected: scraped web data, API exports, sensor readings, user-generated content, third-party dataset downloads. This bucket is write-once. Nothing in raw is ever modified or deleted. If a data collection run produces bad data, the bad data stays and is noted in metadata — the raw record is the ground truth of what was collected.

`company-ml-datasets` — Processed and versioned datasets ready for training. Data in this bucket has been cleaned, labelled (or annotation metadata has been linked), and prepared in training-ready format. Every dataset is versioned with a semantic version key prefix: datasets/image-classification/v1.2.0/ contains the complete training, validation, and test splits for version 1.2.0 of the image classification dataset.

`company-ml-artifacts` — Model checkpoints, final model weights, experiment outputs, evaluation metrics. Each experiment gets a directory with the run ID from your experiment tracker (MLflow, W&B, or similar). The run ID is the link between the code, the dataset version, and the model artifact.

`company-ml-annotations` — Raw annotation files, annotation tool exports, and annotation job metadata. Kept separate because annotation data has a different access pattern and lifecycle from training data.

Dataset Versioning

Dataset versioning is the practice of treating each meaningful version of a training dataset as an immutable, addressable artifact — exactly like software versions. Version 1.0.0 of a dataset is a specific set of files with a specific label distribution. Version 1.1.0 adds new data from a second collection run. Version 2.0.0 changes the label taxonomy.

Key prefix versioning is the simplest approach: store each version under a prefix that includes the version number — datasets/ner-hindi/v1.0.0/train/, datasets/ner-hindi/v1.0.0/val/, datasets/ner-hindi/v1.0.0/test/. When you create a new version, write to a new prefix. The old version remains unchanged.

Dataset manifest files record the contents of each version: a JSON file at datasets/ner-hindi/v1.0.0/manifest.json that lists every file in the dataset, its size, its SHA256 checksum, and any relevant metadata (annotation tool version, labeller IDs, collection date range). The manifest is the authoritative record of what a dataset version contains.

IBEE object versioning provides an additional safety layer at the bucket level. Enable versioning on the company-ml-datasets bucket — if a key is accidentally overwritten, the previous version of the object is preserved and recoverable.

Efficient Training Data Loading

Loading training data efficiently from object storage is a function of how data is stored and how the training loop reads it.

Use columnar formats for structured data. Parquet files store data in column-major order, which means reading a subset of columns (features for training) does not require reading the full row. For tabular ML datasets with many features, Parquet reading is significantly faster than CSV.

Use sharded files for large image and audio datasets. Loading 1 million individual JPEG files from object storage involves 1 million GET requests, each with network round-trip overhead. Sharding the dataset into WebDataset-format tar archives (or TFRecord files for TensorFlow) reduces 1 million GET requests to a few hundred, dramatically improving data loading throughput.

WebDataset is an open format for streaming large datasets from object storage. Each shard is a tar archive containing matched sample files (e.g. 000001.jpg and 000001.json for image-label pairs). PyTorch's DataLoader reads WebDataset shards from S3-compatible URLs with parallelism controlled by the num_workers setting.

The S3 URL format works with IBEE by setting the AWS_ENDPOINT_URL environment variable to IBEE's endpoint before initialising the dataset.

Prefetch and cache. For training runs longer than a few hours, consider caching the first epoch's data to a local NVMe disk and serving subsequent epochs from cache. This eliminates repeated object storage GET requests for the same files.

Annotation Workflow Storage

Annotation pipelines generate data at multiple stages: raw assets for annotation, completed annotation exports, annotation quality review outputs, and final merged labels. Each stage needs to be preserved.

Raw annotation assets — images, audio clips, text documents — are stored in the company-ml-annotations bucket under a job ID prefix: annotation-jobs/ner-job-2024-03/assets/. These are the files sent to annotators.

Annotation exports from labelling tools (Label Studio, CVAT, Labelbox) are stored alongside the assets: annotation-jobs/ner-job-2024-03/exports/2024-03-15-export.json. Multiple exports per job are preserved — if you re-annotate disagreements, both the original and revised exports are kept.

Merged labels — the final reconciled annotation file used to produce a training dataset version — are stored in the dataset bucket under the corresponding version prefix.

This structure means that for any training dataset version, you can trace back to the exact annotation job that produced the labels and the exact export file used.

Model Artifact Storage

Every training run produces artifacts: checkpoints saved during training, the final model weights, evaluation metrics, and any generated outputs (confusion matrices, loss curves, sample predictions).

Store artifacts under a run ID that links to your experiment tracker:

The run-abc123 key links to the MLflow or W&B run record, which records the dataset version, hyperparameters, and code commit used for that run. The triangle of (code version, dataset version, artifact) is fully traceable from any direction.

Access Control for ML Data

ML data has different sensitivity profiles. Raw data collected from users may contain PII. Model weights for production models should be readable by the inference serving layer but not by every researcher. Annotation job assets should be accessible to annotation tooling but not to the entire engineering organisation.

IBEE's IAM-style access controls allow per-bucket policies. A production deployment role gets read access to the company-ml-artifacts/final/ prefix. Annotation tooling gets read/write access to company-ml-annotations/. Researchers get read access to processed datasets and write access to their own experiment output prefix.

IBEE for Indian ML Teams

For Indian AI companies and research teams, IBEE provides S3-compatible ML dataset storage on India-sovereign infrastructure — relevant for teams working with Indian language datasets, health data, financial data, or any dataset with data residency obligations. At Rs.1.50/GB-month and Rs.2/GB egress, the cost of storing and accessing multi-terabyte ML datasets is predictable and significantly cheaper than hyperscaler alternatives.

How to Store and Manage ML Datasets on Object Storage — Patterns for AI Teams