Building a Data Lake on IBEE Object S…

For data engineers, analytics architects, and technical leads building enterprise data infrastructure for Indian businesses.

Object Storage as the Data Lake Foundation

The data warehouse held structured data in proprietary formats, queryable only through its own engine. The data lake replaced the proprietary format with a principle: store everything in open formats on cheap, scalable object storage, and let any query engine read it.

This architecture — where object storage is the single source of truth and compute is decoupled from storage — is now the default for any serious analytics infrastructure. AWS Glue reads from S3. Databricks reads from S3. Apache Spark reads from S3. DuckDB reads from S3. The query engine is interchangeable. The storage layer is not.

IBEE's full S3 compatibility means every tool that reads data lakes from S3 reads from IBEE. For Indian businesses that need their analytics data to remain India-sovereign — under Indian jurisdiction, on Indian infrastructure — this is the path to a data lake that satisfies both the analytics team and the legal team.

The Medallion Architecture

The most widely adopted data lake structure is the medallion architecture: three layers of data refinement, each stored in the same object storage system, each serving a different consumer.

Bronze — Raw Ingestion Layer

The bronze layer stores data exactly as it arrives from source systems — unmodified, unvalidated, in whatever format the source emitted. Database change data capture (CDC) streams, API event logs, CSV exports from business systems, IoT sensor readings, clickstream data — all of it lands in the bronze bucket exactly as received.

The purpose of the bronze layer is durability and reproducibility. If a transformation in the silver layer turns out to be wrong, you can replay from bronze. If a source system changes its schema, the bronze record of what the data looked like before the change is preserved.

On IBEE, the bronze bucket is configured with versioning enabled — so objects written by streaming ingestion pipelines are preserved even if keys are reused. Write-once, never-delete is the operating principle.

Silver — Cleaned and Conformed Layer

The silver layer contains data that has been validated, deduplicated, schema-enforced, and conformed to a standard format. Bronze JSON events become silver Parquet files with consistent column types and null handling. Bronze CSV exports become silver tables with normalised date formats and trimmed string values.

Silver data is queryable by analysts who need reliable, clean data but do not need it to be aggregated. A data scientist building a churn model reads from silver. A fraud detection pipeline training on transaction history reads from silver.

Silver data is typically stored in Parquet or ORC format, partitioned by date or entity key. This partitioning structure is what allows query engines to push down partition filters and avoid reading the entire dataset for time-bounded queries.

Gold — Business-Ready Aggregated Layer

The gold layer contains pre-aggregated, business-metric-aligned data — daily user activity summaries, weekly revenue by geography, monthly cohort retention tables. Gold data is what feeds dashboards, business intelligence tools, and executive reporting.

Gold tables are typically small compared to silver, because aggregation compresses many raw events into a single summary row. They are optimised for fast read performance and are usually partitioned to match the time granularity of the reports they serve.

Open Table Formats: Delta Lake and Apache Iceberg

Raw Parquet files in a data lake work for simple analytics. At production scale, two problems emerge: concurrent writes from multiple pipelines corrupt the directory structure, and schema evolution over time makes it impossible to know what columns a given file contains without reading it.

Open table formats — Delta Lake and Apache Iceberg are the two dominant options — solve both problems by adding a transaction log on top of the raw Parquet files.

Delta Lake stores a _delta_log directory alongside the data files in each table's S3 prefix. Every write to the table — append, overwrite, merge, delete — creates a new entry in the transaction log. Readers use the log to determine which files are part of the current table snapshot. Delta Lake provides ACID transactions, schema enforcement, time travel (query the table as it was at any previous point), and schema evolution.

Apache Iceberg achieves the same goals with a different metadata model — a tree of metadata files that track the current and historical snapshots of the table. Iceberg has stronger support for partition evolution (you can change how the table is partitioned without rewriting all the data) and hidden partitioning (the table tracks partition values without requiring them to appear as directory names in the path).

Both Delta Lake and Iceberg read and write data from S3-compatible storage. Both work on IBEE with no modification — point the Spark configuration or the Iceberg catalog at IBEE's endpoint and they behave identically to S3.

Query Engine Integration

The value of the data lake architecture is that the storage layer is decoupled from the query layer. Multiple engines can read the same data simultaneously.

Apache Spark is the standard batch processing engine for large-scale data transformation. Spark reads from S3-compatible storage through the Hadoop S3A connector. Configure spark.hadoop.fs.s3a.endpoint to point at IBEE's endpoint and spark.hadoop.fs.s3a.path.style.access to true — this is the only configuration change required to run Spark jobs against IBEE rather than AWS S3.

Trino and Presto are distributed SQL query engines designed for interactive analytics against data lake storage. Both read from S3-compatible storage through the Hive connector with S3 endpoint configuration, and both support Delta Lake and Iceberg table formats through their respective connectors.

DuckDB is an in-process analytical database that can query Parquet files directly from S3-compatible storage using the httpfs extension. For data teams that want SQL query capability against lake data without spinning up a cluster, DuckDB against IBEE provides a fast, low-cost interactive query option.

Apache Flink for stream processing reads from and writes to S3-compatible storage for checkpointing and output sink. Configure the S3 endpoint in the Flink S3 plugin configuration.

Ingestion Patterns for the IBEE Data Lake

Batch ingestion from relational databases uses tools like Airbyte, dbt, or custom Spark jobs. The typical pattern is extract from source, transform into target schema, write Parquet to the silver bucket on a schedule.

Stream ingestion from event queues (Kafka, Pulsar) uses Flink or Spark Streaming to consume events and write micro-batch Parquet files to the bronze bucket in near real-time. S3-compatible storage is the standard checkpoint and output store for both engines.

Change data capture from databases uses Debezium to stream row-level changes to Kafka, then Flink to write change events to the bronze bucket. This captures every insert, update, and delete from the source database without impacting source query performance.

Data Governance on India-Sovereign Storage

For Indian enterprises running data lakes with customer data, financial records, or regulated data of any kind, the jurisdiction of the storage layer matters as much as its technical capabilities. A data lake on AWS S3 or GCP Cloud Storage contains Indian enterprise data on infrastructure governed by US federal law. A data lake on IBEE contains the same data on India-sovereign infrastructure, governed by Indian law.

For businesses subject to DPDP Act obligations, RBI data localisation requirements, or any enterprise contract that requires India-resident data custody, IBEE provides the sovereignty guarantee that hyperscaler storage cannot — regardless of which region the data is physically stored in.

Getting Started

Create your bronze, silver, and gold buckets on IBEE. Configure your Spark or Flink cluster with IBEE's S3 endpoint. Start with batch ingestion from your primary data source into the bronze bucket. Add a silver transformation job. Query with DuckDB or Trino.

The architecture is identical to an AWS S3-based data lake. The jurisdiction is different.

Building a Data Lake on IBEE Object Storage — Architecture Guide