Le Duy Khuong (Daniel)

Dev Productivity & Tools

Lakehouse BRD — Chapter 4: Ingestion Layer (Airbyte, Kafka, Logstash)

Ingestion Layer: collect from CRM, Core, Risk, Payment, API; batch (Airbyte) and streaming (Kafka); CDC, retry, schema registry, quality control.

2026-03-172 min read

4.1 Objectives

  • Collect data from business and partner systems (CRM, Core Lending, Risk, Payment, API)
  • Support batch (ETL/ELT) and streaming (real-time)
  • Ensure correct format, no data loss, full logging
  • Foundation for staging → curated → analytics

4.2 Components

ComponentMain functionTool
AirbyteBatch ingest (PostgreSQL, MySQL, REST, CSV, Excel)Airbyte OSS
Apache KafkaReal-time stream (events, webhooks, changelog)Kafka, Kafka Connect
LogstashLog ingest, operational streamsLogstash, Filebeat
Schema RegistryStream schema managementConfluent Schema Registry
Ingest ControllerOrchestration and validationAirflow / Cron / CLI

4.3 Ingest data sources

SourceIngest typeTechnology
CRM (PostgreSQL)Batch CDCAirbyte
Core Lending (SQL Server)BatchAirbyte
Payment partnersAPI streamingKafka REST / Webhook
Internal transactionsLog fileLogstash
Risk Engine logsCSV/JSONFilebeat → Logstash
Manual uploadBatch snapshotAirbyte / CLI

4.4 Technical setup

Airbyte: CDC with PostgreSQL/MySQL; destination MinIO + Parquet; 3 retries with exponential backoff; schedule 1h/daily; log by job_id; metadata __ingest_time, __source_system, __batch_id.

Kafka: Topic pattern by domain (e.g. org.crm.customer_update, org.payment.result); JSON/Avro; partition by store_id/region; Schema Registry; monitoring Kafka Exporter + Prometheus + Grafana.

Logstash: Input Filebeat/Log/TCP; Grok filter; output Kafka or MinIO; tag log_type, system, error_level.

4.5 Control and quality

Schema drift detection; row count comparison source vs raw; optional hash check; per-job logging; failover and retry.

4.6 Security

Credentials in Vault; PII only tagged at this layer; domain-based access; Kafka audit trail.

4.7 Monitoring

Prometheus (jobs, lag, retries); Grafana (volume, throughput); Airbyte UI; Logstash logs; Kafka Exporter.

Sample pipeline: PostgreSQL CRM → Airbyte (CDC) → Parquet raw.crm.customer_raw → logging → Grafana.

LDK

Le Duy Khuong

AI Transformation & Digital Strategy. Writing about agentic systems, engineering leadership, and building in public.