Le Duy Khuong (Daniel)

Dev Productivity & Tools

Lakehouse BRD — Chapter 2: Proposed system architecture

Seven-layer Lakehouse architecture: Ingestion, Processing, Storage Format, Storage, Metadata, Query, Consumption. Batch and streaming.

2026-03-172 min read

2.1 Logical architecture overview

The Lakehouse system follows a standard 7-layer architecture, supporting the flow from ingest to consumption and AI/BI:

LayerMain function
1. Ingestion LayerConnect and collect data from source systems (database, API, file, streaming)
2. Processing LayerClean, normalize, enrich, transform (batch and real-time)
3. Storage Format LayerColumnar storage, versioning and schema evolution (Parquet, Delta, Iceberg)
4. Storage LayerPhysical storage (S3/MinIO), partitioning, query optimization
5. Metadata & CatalogSchema, lineage, data description, and access control
6. Query LayerHigh-performance query for dashboard, AI, API
7. Consumption LayerBI (Superset, Power BI), partner API integration, ML/AI

Architecture Layers

7

Ingestion to Consumption

Storage Formats

3

Iceberg / Delta / Parquet

Environments

3

DEV / STAGING / PROD

2.2 Physical architecture diagram

Loading diagram...
  • The system can run on-prem, hybrid, or fully in the cloud.
  • Data is split into raw, staging, curated, and analytics.

2.3 Processing model: Batch and streaming

Pipeline typeCharacteristicsSuggested stack
BatchHourly/daily ETL from core systemsAirbyte + dbt + Spark
StreamingReal-time data (scoring, transactions)Kafka + Flink + Iceberg

Technology complexity per layer:

1. Ingestion
40
2. Processing
85
3. Storage Format
60
4. Storage
35
5. Metadata
70
6. Query
75
7. Consumption
50

▲ Complexity score (0–100) — Processing & Query layers require the most tuning

2.4 Sample pipeline: CRM → Lakehouse → BI dashboard

  1. Airbyte ingests from PostgreSQL CRM (CDC).
  2. Spark processes and normalizes customer data.
  3. Data is stored in MinIO as Iceberg (curated layer).
  4. DataHub records metadata and lineage.
  5. Trino queries build materialized views.
  6. Superset serves reports.
Loading diagram...
LayerExample
Rawcrm_raw.customer_profile
Stagingcrm_stg.customer_profile_cleaned
Curatedcrm_cur.customer_profile_enriched
Analyticscrm_ana.customer_segmentation_dashboard

Data quality progression across layers:

Raw (as-is)~30% quality
Staging (cleaned)~65% quality
Curated (enriched)~90% quality
Analytics (aggregated)~98% quality

2.5 Environment configuration

EnvironmentPurposeData policy
DEVPipeline testing, validationAnonymized data
STAGINGUAT, close to PRODFull data, delayed timing
PRODProduction operationsReal data, audit and control

CI/CD: deploy pipelines via CI, rollback on failure, version tags and full change logs.

LDK

Le Duy Khuong

AI Transformation & Digital Strategy. Writing about agentic systems, engineering leadership, and building in public.