Dev Productivity & Tools
Lakehouse BRD — Chapter 4: Ingestion Layer (Airbyte, Kafka, Logstash)
Ingestion Layer: collect from CRM, Core, Risk, Payment, API; batch (Airbyte) and streaming (Kafka); CDC, retry, schema registry, quality control.
2026-03-172 min read
4.1 Objectives
- Collect data from business and partner systems (CRM, Core Lending, Risk, Payment, API)
- Support batch (ETL/ELT) and streaming (real-time)
- Ensure correct format, no data loss, full logging
- Foundation for staging → curated → analytics
4.2 Components
| Component | Main function | Tool |
|---|---|---|
| Airbyte | Batch ingest (PostgreSQL, MySQL, REST, CSV, Excel) | Airbyte OSS |
| Apache Kafka | Real-time stream (events, webhooks, changelog) | Kafka, Kafka Connect |
| Logstash | Log ingest, operational streams | Logstash, Filebeat |
| Schema Registry | Stream schema management | Confluent Schema Registry |
| Ingest Controller | Orchestration and validation | Airflow / Cron / CLI |
4.3 Ingest data sources
| Source | Ingest type | Technology |
|---|---|---|
| CRM (PostgreSQL) | Batch CDC | Airbyte |
| Core Lending (SQL Server) | Batch | Airbyte |
| Payment partners | API streaming | Kafka REST / Webhook |
| Internal transactions | Log file | Logstash |
| Risk Engine logs | CSV/JSON | Filebeat → Logstash |
| Manual upload | Batch snapshot | Airbyte / CLI |
4.4 Technical setup
Airbyte: CDC with PostgreSQL/MySQL; destination MinIO + Parquet; 3 retries with exponential backoff; schedule 1h/daily; log by job_id; metadata __ingest_time, __source_system, __batch_id.
Kafka: Topic pattern by domain (e.g. org.crm.customer_update, org.payment.result); JSON/Avro; partition by store_id/region; Schema Registry; monitoring Kafka Exporter + Prometheus + Grafana.
Logstash: Input Filebeat/Log/TCP; Grok filter; output Kafka or MinIO; tag log_type, system, error_level.
4.5 Control and quality
Schema drift detection; row count comparison source vs raw; optional hash check; per-job logging; failover and retry.
4.6 Security
Credentials in Vault; PII only tagged at this layer; domain-based access; Kafka audit trail.
4.7 Monitoring
Prometheus (jobs, lag, retries); Grafana (volume, throughput); Airbyte UI; Logstash logs; Kafka Exporter.
Sample pipeline: PostgreSQL CRM → Airbyte (CDC) → Parquet raw.crm.customer_raw → logging → Grafana.
