Dev Productivity & Tools
Lakehouse BRD — Appendix: Layer spec & standard lineage
BRD appendix: 6 layers (Raw, Staging, Curated, Analytics, Lineage & Metadata, Audit); 4-step lineage raw→stg→cur→ana.
2026-03-172 min read
Supplement to Chapter 3 (Data flow): specification of layers and standard lineage in the Lakehouse architecture.
Appendix A – Layer specification
| Layer | Full name | Main role | OSS technology | Characteristics |
|---|---|---|---|---|
| 1. Raw | Raw data | Store original data from sources | Airbyte, Kafka, Filebeat | No schema normalization; scheduled or stream |
| 2. Staging | Initial cleansing | Cleansing & mapping | Spark, dbt | Normalize format, nulls, validate keys |
| 3. Curated | Enriched data | Business logic, domain split | Spark, dbt, Dask | For dashboards, ML |
| 4. Analytics | Analytical aggregates | High-performance query, reporting | Trino, DuckDB | Materialized views, pre-aggregation |
| 5. Lineage & Metadata | Description, tagging, ownership | Controlled management | DataHub, Hive, Amundsen | API scan, domain-based access |
| 6. Audit | Change and trace log | Audit, rollback | Delta/Iceberg + Git/Nessie | Version, schema evolution |
Each layer has owner, purpose, retention policy, data contract. Raw tables are not used directly for BI/ML.
Appendix B – Standard lineage (4 steps)
Lineage shows data from raw to BI/AI.
Step 1 – Raw: raw.crm.customer_raw; source PostgreSQL CRM; snapshot or CDC; columns id, full_name, dob, phone_raw, created_at.
Step 2 – Staging: stg.crm.customer_clean; dedupe, normalize phone, validate email; add is_valid, cleaned_timestamp.
Step 3 – Curated: cur.crm.customer_profile; join loan_contracts, enrich behavior; segment, LTV, initial risk.
Step 4 – Analytics: ana.crm.customer_lifetime_value; aggregate LTV, frequency, churn rate; for BI, ML.
| Step | Table | Main transform | Frequency |
|---|---|---|---|
| 1 | raw.crm.customer_raw | None | real-time (CDC) |
| 2 | stg.crm.customer_clean | Regex clean, null | every 1h |
| 3 | cur.crm.customer_profile | Join loan, enrich risk | every 3h |
| 4 | ana.crm.customer_ltv | 90-day aggregate | daily |
Requirements: each table has .yml schema (dbt), domain & sensitivity tags, data_owner; lineage auto-updated in DataHub/Amundsen; schema versioning (Iceberg history / Nessie).
