A data pipeline is only as good as the decisions made at each stage of its construction. Too many organisations bolt pipelines together reactively — ingesting data here, transforming it there, exposing it through whichever tool was available — and end up with a fragile, undocumented mess that nobody trusts. Here is what a principled, production-grade pipeline actually looks like.
Ingestion: Batch vs Streaming
The first decision is batch versus streaming ingestion. For operational databases, CDC (Change Data Capture) tools like Debezium publishing to Apache Kafka give you real-time event streams with sub-second latency. For external APIs and file-based sources, scheduled batch ingestion with Airflow or Prefect remains the pragmatic choice. The right answer depends on your downstream latency requirements — not on what is fashionable at the time of building.
The Transformation Layer
Raw ingested data is almost never directly usable for analytics. It needs cleaning, deduplication, type coercion, business rule application, and dimensional modelling. dbt (Data Build Tool) has become the standard for SQL-based transformations — it brings software engineering practices like version control, testing, and documentation to the data transformation layer. Structure your transformations in layers: staging (raw to typed), intermediate (business logic), and mart (analytics-ready).
Data quality is not a data engineering problem — it is a data governance problem that data engineers are left to solve. Fix it upstream or document it loudly.
Serving: Making Data Actually Useful
The final mile of a pipeline is often the least discussed and most critical. A beautifully engineered data model that nobody can query is worthless. Choose your serving layer based on your consumers: a semantic layer like Cube or MetaBase for business analysts who think in business terms, direct SQL access via Redshift or BigQuery for data scientists, and pre-aggregated REST APIs for product teams embedding analytics into their applications. Document every mart table, every metric definition, and every refresh cadence — ideally in dbt docs auto-generated from your model metadata.