Le Duy Khuong (Daniel)

Dev Productivity & Tools

Lakehouse BRD — Appendix: Layer spec & standard lineage

BRD appendix: 6 layers (Raw, Staging, Curated, Analytics, Lineage & Metadata, Audit); 4-step lineage raw→stg→cur→ana.

2026-03-172 min read

Supplement to Chapter 3 (Data flow): specification of layers and standard lineage in the Lakehouse architecture.

Appendix A – Layer specification

LayerFull nameMain roleOSS technologyCharacteristics
1. RawRaw dataStore original data from sourcesAirbyte, Kafka, FilebeatNo schema normalization; scheduled or stream
2. StagingInitial cleansingCleansing & mappingSpark, dbtNormalize format, nulls, validate keys
3. CuratedEnriched dataBusiness logic, domain splitSpark, dbt, DaskFor dashboards, ML
4. AnalyticsAnalytical aggregatesHigh-performance query, reportingTrino, DuckDBMaterialized views, pre-aggregation
5. Lineage & MetadataDescription, tagging, ownershipControlled managementDataHub, Hive, AmundsenAPI scan, domain-based access
6. AuditChange and trace logAudit, rollbackDelta/Iceberg + Git/NessieVersion, schema evolution

Each layer has owner, purpose, retention policy, data contract. Raw tables are not used directly for BI/ML.

Appendix B – Standard lineage (4 steps)

Lineage shows data from raw to BI/AI.

Step 1 – Raw: raw.crm.customer_raw; source PostgreSQL CRM; snapshot or CDC; columns id, full_name, dob, phone_raw, created_at.

Step 2 – Staging: stg.crm.customer_clean; dedupe, normalize phone, validate email; add is_valid, cleaned_timestamp.

Step 3 – Curated: cur.crm.customer_profile; join loan_contracts, enrich behavior; segment, LTV, initial risk.

Step 4 – Analytics: ana.crm.customer_lifetime_value; aggregate LTV, frequency, churn rate; for BI, ML.

StepTableMain transformFrequency
1raw.crm.customer_rawNonereal-time (CDC)
2stg.crm.customer_cleanRegex clean, nullevery 1h
3cur.crm.customer_profileJoin loan, enrich riskevery 3h
4ana.crm.customer_ltv90-day aggregatedaily

Requirements: each table has .yml schema (dbt), domain & sensitivity tags, data_owner; lineage auto-updated in DataHub/Amundsen; schema versioning (Iceberg history / Nessie).

LDK

Le Duy Khuong

AI Transformation & Digital Strategy. Writing about agentic systems, engineering leadership, and building in public.