Dev Productivity & Tools
Lakehouse BRD — Chapter 3: Data flow
End-to-end data flow: sources (Core, CRM, Payment, Risk, API), Airbyte/Kafka/Flink, Spark/dbt, Iceberg/MinIO, DataHub, Trino, Superset/ML/API.
2026-03-172 min read
3.1 Input data sources
Organizations typically have multiple business systems: core internal, CRM, Payment, Risk Engine, log files, partner APIs.
| System type | Typical data sources | Technology |
|---|---|---|
| Core Lending | PostgreSQL, SQL Server | CDC/ETL |
| CRM | MySQL/PostgreSQL | CDC/ETL |
| Payment | Log file, message queue | File, Kafka |
| Risk & Scoring | Python output, CSV, Excel | File drop |
| Partners | API pull, webhook | REST, JSON |
| Logs | Nginx, app, transaction | Filebeat, Logstash |
3.2 End-to-end data flow diagram
Source Systems → Airbyte (Batch) / Kafka (Streaming) → Apache Flink (real-time ETL) → Spark / dbt → MinIO (Iceberg) ← Apache Iceberg Tables → DataHub / Amundsen (Catalog) → Trino / DuckDB → Superset / ML / API consumers.
3.3 Key data mapping
| System | Typical source tables | Lakehouse tables (curated) |
|---|---|---|
| CRM | customer, customer_kyc | curated.crm.customer, curated.crm.kyc |
| Core Lending | loan_contracts, repayment | curated.lending.contracts, .repayment |
| Payment | payment_transactions | curated.payment.transaction |
| Risk | blacklist, risk_score_log | curated.risk.blacklist, curated.risk.scores |
| Logs | nginx_logs, error_logs | raw.logs.nginx, raw.logs.error |
3.4 Metadata and cataloging
Open-source metadata stack:
| Component | Typical tool |
|---|---|
| Metadata platform | DataHub |
| Lineage tracking | Kafka + Spark + Iceberg + DataHub |
| Tag & classification | Domain, sensitivity, use case |
| API metadata | Sync from OpenAPI / REST spec |
| Data owner | Each table has steward/owner |
Example for table curated.crm.customer: Domain: Customer; Confidentiality: High; Owner/Steward by department.
3.5 Lineage from raw to analytics
Standard lineage chain: Raw (original data, snapshot/CDC) → Staging (cleansing, validation) → Curated (enrichment, business logic) → Analytics (materialized views, pre-aggregation). Each step is recorded in the catalog and lineage.
