Le Duy Khuong (Daniel)

Dev Productivity & Tools

Lakehouse BRD — Chapter 3: Data flow

End-to-end data flow: sources (Core, CRM, Payment, Risk, API), Airbyte/Kafka/Flink, Spark/dbt, Iceberg/MinIO, DataHub, Trino, Superset/ML/API.

2026-03-172 min read

3.1 Input data sources

Organizations typically have multiple business systems: core internal, CRM, Payment, Risk Engine, log files, partner APIs.

System typeTypical data sourcesTechnology
Core LendingPostgreSQL, SQL ServerCDC/ETL
CRMMySQL/PostgreSQLCDC/ETL
PaymentLog file, message queueFile, Kafka
Risk & ScoringPython output, CSV, ExcelFile drop
PartnersAPI pull, webhookREST, JSON
LogsNginx, app, transactionFilebeat, Logstash

3.2 End-to-end data flow diagram

Source Systems → Airbyte (Batch) / Kafka (Streaming) → Apache Flink (real-time ETL) → Spark / dbt → MinIO (Iceberg) ← Apache Iceberg Tables → DataHub / Amundsen (Catalog) → Trino / DuckDB → Superset / ML / API consumers.

3.3 Key data mapping

SystemTypical source tablesLakehouse tables (curated)
CRMcustomer, customer_kyccurated.crm.customer, curated.crm.kyc
Core Lendingloan_contracts, repaymentcurated.lending.contracts, .repayment
Paymentpayment_transactionscurated.payment.transaction
Riskblacklist, risk_score_logcurated.risk.blacklist, curated.risk.scores
Logsnginx_logs, error_logsraw.logs.nginx, raw.logs.error

3.4 Metadata and cataloging

Open-source metadata stack:

ComponentTypical tool
Metadata platformDataHub
Lineage trackingKafka + Spark + Iceberg + DataHub
Tag & classificationDomain, sensitivity, use case
API metadataSync from OpenAPI / REST spec
Data ownerEach table has steward/owner

Example for table curated.crm.customer: Domain: Customer; Confidentiality: High; Owner/Steward by department.

3.5 Lineage from raw to analytics

Standard lineage chain: Raw (original data, snapshot/CDC) → Staging (cleansing, validation) → Curated (enrichment, business logic) → Analytics (materialized views, pre-aggregation). Each step is recorded in the catalog and lineage.

LDK

Le Duy Khuong

AI Transformation & Digital Strategy. Writing about agentic systems, engineering leadership, and building in public.