ibee

Storage Architecture for LLM and RAG Systems — What AI Companies Get Wrong

MohitEngineering team
April 22, 20266 min read

For AI engineers and technical leads building LLM-powered products and RAG systems at Indian AI companies.

The Storage Layers Most Teams Underplan

A RAG (Retrieval-Augmented Generation) system feels architecturally simple: documents go in, embeddings get generated, queries retrieve relevant chunks, the LLM generates a response. The storage complexity is hidden in the details.

In practice, a production RAG system has at least five distinct storage concerns: the raw source documents, the processed and chunked text, the embedding vectors and their metadata, the conversation and session history, and the fine-tuned model weights if you are running your own models. Each of these has different access patterns, different retention requirements, and different storage technology choices.

Teams that treat all of this as a single "data" problem end up with architectures where a document update requires re-indexing the entire corpus, where conversation history cannot be audited, or where model rollbacks require reprocessing weeks of training data.

Layer 1 — Raw Document Storage (Object Storage)

The raw document layer is the system of record. Every document ingested into the RAG system — PDFs, Word files, web pages, Markdown files, structured data exports — is stored here in its original form, unchanged.

Object storage is the right home for raw documents. Documents are typically large (relative to text), rarely accessed after initial processing, and need to be preserved indefinitely in case re-processing is needed with a new chunking strategy or embedding model.

Store documents in IBEE with a stable, deterministic key derived from the document's source identity: documents/source/year/month/document-id.pdf. The key structure enables listing by source, by date range, or by document ID without full corpus scans.

Keep raw documents immutable. If a document is updated at the source, add a new version with a new document ID and timestamp. Do not overwrite the previous version — the change history is part of the audit trail.

The re-processing advantage: when you switch embedding models (from OpenAI text-embedding-ada-002 to a newer model, or to a self-hosted embedding model), you need to regenerate all chunk embeddings. With all raw documents preserved in object storage, re-processing is a compute cost — not a data recovery problem.

Layer 2 — Processed Chunks (Object Storage or Database)

After ingestion, documents are split into chunks of a configured size and overlap (typically 512–2000 tokens with 10–20% overlap). Each chunk is stored with its source document reference, position within the document, and any extracted metadata (section heading, page number, document date).

Processed chunks belong in a lightweight, queryable store — either object storage (as JSON files with the chunk text and metadata) or a PostgreSQL table with a JSON column for metadata.

Object storage is appropriate for large corpora (millions of chunks) where the chunk store is rebuilt from raw documents when re-indexing. A relational database is appropriate for smaller corpora where incremental updates (adding a new document's chunks without touching existing ones) are important.

Store chunks in object storage under chunks/document-id/chunk-{n:04d}.json. This key structure lets you delete all chunks from a specific document by prefix, without scanning unrelated chunks — useful for document removal workflows.

Layer 3 — Embedding Vectors (Vector Database)

Embedding vectors are the numerically dense representations of text chunks that enable semantic search. They are generated by an embedding model (a lightweight neural network that converts text to a high-dimensional float vector) and stored in a vector database that supports approximate nearest-neighbour search.

The vector database is a specialist store — not object storage, not a relational database. Options available for Indian teams include Qdrant (open-source, self-hosted), Weaviate (open-source, self-hosted or cloud), pgvector (PostgreSQL extension), and Chroma (lightweight, for development).

The vector database stores embedding vectors alongside the chunk metadata needed to retrieve the source text: the document ID, chunk index, and a truncated preview of the chunk text. It does not store the full chunk text — that lives in the chunk store (layer 2) and is retrieved by ID after the vector search identifies relevant chunks.

Keeping layers in sync: when a document is updated or deleted, the corresponding chunks in layer 2 and their embedding vectors in layer 3 must be updated. Design your ingestion pipeline to maintain a mapping from document ID to chunk IDs to vector IDs. This mapping (stored in the relational database) is what enables clean document-level updates without full re-indexing.

Layer 4 — Conversation and Session History (Relational Database)

Conversation history — the multi-turn context that allows an LLM to refer to earlier parts of a conversation — needs low-latency reads and writes and structured querying. Object storage is not appropriate here; a relational database with indexed columns is.

Store conversations with the following structure: session ID, user ID, message sequence number, role (user/assistant), message content, timestamp, and any retrieved chunk IDs that were used to generate the assistant's response.

The retrieved chunk IDs are the audit trail for each LLM response — they allow you to trace which documents contributed to a given answer, which is increasingly important for regulated applications (financial advice, medical information, legal guidance) where the provenance of AI-generated responses must be explainable.

Layer 5 — Fine-Tuned Model Weights (Object Storage)

If your product involves fine-tuning a base LLM on domain-specific data — legal documents, medical literature, financial reports, Indian language corpora — the resulting model weights need to be stored, versioned, and deployable.

Object storage is the correct store for model weights. A fine-tuned 7B parameter model produces weights of approximately 14–28 GB (depending on quantisation). Store weights in IBEE under models/model-name/v1.2.0/weights.safetensors. Use versioned key prefixes so you can roll back to a previous model version by pointing the serving layer at the previous prefix.

Keep the training configuration (config.json) and a record of the training dataset version alongside the weights — the same run ID linkage described in the ML dataset storage article. The ability to reproduce a given model version is as important as the ability to serve it.

The India-Sovereignty Dimension

For Indian AI companies building on documents that contain Indian user data — medical records, legal filings, financial documents, government records — the data residency of the RAG system's document store is a compliance question.

A RAG system ingesting Indian health records where the raw document store lives on AWS S3 or GCP means Indian health data is processed on infrastructure subject to US federal law. For healthcare AI companies with hospital clients, this is a procurement barrier. For companies building on government document corpora, it may be a legal requirement.

IBEE's India-sovereign storage provides the document store layer on Indian infrastructure, under Indian law. The vector database and relational database layers can run on self-hosted compute within the same Indian data centre. The complete RAG system can be India-resident end-to-end.

Related articles