Lakehouse BRD Chapter 2 — Architecture Design and Technology Stack

2.1 Logical architecture overview

The Lakehouse system follows a standard 7-layer architecture, supporting the flow from ingest to consumption and AI/BI:

Layer	Main function
1. Ingestion Layer	Connect and collect data from source systems (database, API, file, streaming)
2. Processing Layer	Clean, normalize, enrich, transform (batch and real-time)
3. Storage Format Layer	Columnar storage, versioning and schema evolution (Parquet, Delta, Iceberg)
4. Storage Layer	Physical storage (S3/MinIO), partitioning, query optimization
5. Metadata & Catalog	Schema, lineage, data description, and access control
6. Query Layer	High-performance query for dashboard, AI, API
7. Consumption Layer	BI (Superset, Power BI), partner API integration, ML/AI

Architecture Layers

Ingestion to Consumption

Storage Formats

Iceberg / Delta / Parquet

Environments

DEV / STAGING / PROD

Loading diagram...

Pipeline type	Characteristics	Suggested stack
Batch	Hourly/daily ETL from core systems	Airbyte + dbt + Spark
Streaming	Real-time data (scoring, transactions)	Kafka + Flink + Iceberg

Technology complexity per layer:

1. Ingestion

2. Processing

3. Storage Format

4. Storage

5. Metadata

6. Query

7. Consumption

▲ Complexity score (0–100) — Processing & Query layers require the most tuning

Loading diagram...

Layer	Example
Raw	`crm_raw.customer_profile`
Staging	`crm_stg.customer_profile_cleaned`
Curated	`crm_cur.customer_profile_enriched`
Analytics	`crm_ana.customer_segmentation_dashboard`

Data quality progression across layers:

Raw (as-is)~30% quality

Staging (cleaned)~65% quality

Curated (enriched)~90% quality

Analytics (aggregated)~98% quality

Environment	Purpose	Data policy
DEV	Pipeline testing, validation	Anonymized data
STAGING	UAT, close to PROD	Full data, delayed timing
PROD	Production operations	Real data, audit and control

CI/CD: deploy pipelines via CI, rollback on failure, version tags and full change logs.