Le Duy Khuong (Daniel)

Engineering Leadership

Lakehouse architecture overview (7 layers)

Seven-layer Lakehouse: Ingestion (Airbyte, Kafka), Processing (Spark, Flink, dbt), Storage Format (Iceberg, Delta), MinIO, Metadata, Trino, Consumption.

2026-03-171 min read

Design goals

  1. Unified data: Single source of truth for the organization
  2. Real-time analytics: Batch and streaming
  3. Scalability: Scale with business demand
  4. Cost efficiency: Open source stack
  5. Compliance: e.g. personal data regulations (applicable data protection regulations)

Seven-layer architecture

LayerPurposeTechnologyPattern
1. IngestionCollect from source systemsAirbyte, Kafka, LogstashCDC, event streaming, batch
2. ProcessingProcess, clean, transformSpark, Flink, dbtETL/ELT, stream processing
3. Storage FormatStorage format, schema evolutionIceberg, Delta Lake, ParquetACID, time travel
4. StoragePhysical object storageMinIO (S3-compatible)Partitioning, compression
5. Metadata & CatalogLineage, governanceDataHub, Apache AtlasDiscovery, lineage, quality
6. QuerySQL engineTrino, DuckDBFederated queries, caching
7. ConsumptionEnd-user consumptionSuperset, FastAPI, MLflowBI, API, ML serving

Data flow

Source Systems → Ingestion → Processing → Storage Format → Object Storage → Metadata Catalog → Query Engine → Consumption apps.

Source system → Lakehouse mapping

Source system (example)Ingestion methodTarget layer
CRM (PostgreSQL)Airbyte CDCRaw → Staging
Core Lending (SQL Server)Batch ETLRaw → Curated
Payment APIsKafka streamingRaw → Analytics
Risk EngineFile-based ETLStaging → Curated

Data organization

Layers: raw/ (immutable) → staging/curated/analytics/

Naming: {environment}/{layer}/{domain}/{table_name}/
Example: prod/curated/customer/customer_master/, staging/raw/lending/loan_applications/

LDK

Le Duy Khuong

AI Transformation & Digital Strategy. Writing about agentic systems, engineering leadership, and building in public.