Dev Productivity & Tools
Lakehouse BRD — Chapter 2: Proposed system architecture
Seven-layer Lakehouse architecture: Ingestion, Processing, Storage Format, Storage, Metadata, Query, Consumption. Batch and streaming.
2026-03-172 min read
2.1 Logical architecture overview
The Lakehouse system follows a standard 7-layer architecture, supporting the flow from ingest to consumption and AI/BI:
| Layer | Main function |
|---|---|
| 1. Ingestion Layer | Connect and collect data from source systems (database, API, file, streaming) |
| 2. Processing Layer | Clean, normalize, enrich, transform (batch and real-time) |
| 3. Storage Format Layer | Columnar storage, versioning and schema evolution (Parquet, Delta, Iceberg) |
| 4. Storage Layer | Physical storage (S3/MinIO), partitioning, query optimization |
| 5. Metadata & Catalog | Schema, lineage, data description, and access control |
| 6. Query Layer | High-performance query for dashboard, AI, API |
| 7. Consumption Layer | BI (Superset, Power BI), partner API integration, ML/AI |
Architecture Layers
7
Ingestion to Consumption
Storage Formats
3
Iceberg / Delta / Parquet
Environments
3
DEV / STAGING / PROD
2.2 Physical architecture diagram
Loading diagram...
- The system can run on-prem, hybrid, or fully in the cloud.
- Data is split into
raw,staging,curated, andanalytics.
2.3 Processing model: Batch and streaming
| Pipeline type | Characteristics | Suggested stack |
|---|---|---|
| Batch | Hourly/daily ETL from core systems | Airbyte + dbt + Spark |
| Streaming | Real-time data (scoring, transactions) | Kafka + Flink + Iceberg |
Technology complexity per layer:
1. Ingestion40
2. Processing85
3. Storage Format60
4. Storage35
5. Metadata70
6. Query75
7. Consumption50
▲ Complexity score (0–100) — Processing & Query layers require the most tuning
2.4 Sample pipeline: CRM → Lakehouse → BI dashboard
- Airbyte ingests from PostgreSQL CRM (CDC).
- Spark processes and normalizes customer data.
- Data is stored in MinIO as Iceberg (curated layer).
- DataHub records metadata and lineage.
- Trino queries build materialized views.
- Superset serves reports.
Loading diagram...
| Layer | Example |
|---|---|
| Raw | crm_raw.customer_profile |
| Staging | crm_stg.customer_profile_cleaned |
| Curated | crm_cur.customer_profile_enriched |
| Analytics | crm_ana.customer_segmentation_dashboard |
Data quality progression across layers:
Raw (as-is)~30% quality
Staging (cleaned)~65% quality
Curated (enriched)~90% quality
Analytics (aggregated)~98% quality
2.5 Environment configuration
| Environment | Purpose | Data policy |
|---|---|---|
| DEV | Pipeline testing, validation | Anonymized data |
| STAGING | UAT, close to PROD | Full data, delayed timing |
| PROD | Production operations | Real data, audit and control |
CI/CD: deploy pipelines via CI, rollback on failure, version tags and full change logs.
