What is a data lake vs data warehouse vs lakehouse?

Data lake: flexible file store (Parquet, CSV) in object storage. Data warehouse: relational store for analytical SQL. Lakehouse: lake + ACID, time travel, schema (e.g. Delta Lake on ADLS). Use lake for raw; warehouse for relational reporting; lakehouse for one store for batch, streaming, ML.

What is incremental load and why use it?

Incremental load means only reading new or changed rows since the last run (using a watermark column). Avoids reprocessing full history every run—faster and cheaper. Design for idempotency with merge or replace partition.

How do I run a batch pipeline on a schedule in ADF?

Create a pipeline with Copy and transform activities; add a Schedule trigger (e.g. daily at 2 AM). Use parameters (e.g. PipelineStartTime) and a control table for watermarks so the same pipeline runs incrementally each time.

What is Synapse Spark vs Databricks on Azure?

Synapse Spark: Spark in the Synapse workspace; integrated with Synapse SQL and pipelines. Databricks: Spark on Azure (Databricks platform); strong for Delta, ML, and streaming. Both can run Delta Lake on ADLS; choose based on workspace preference and tooling.

How do I merge (upsert) data in Delta Lake?

Use MERGE INTO in Spark SQL: ON target.key = source.key, WHEN MATCHED THEN UPDATE, WHEN NOT MATCHED THEN INSERT. Ensures idempotency: re-running with the same source data overwrites or inserts correctly without duplicates.

Article

Data Engineering on Azure: Pipelines, Lakehouse, and Analytics at Scale

Q: What is the difference between Azure Data Factory and Synapse pipelines?

Data Factory and Synapse pipelines use the same engine. Synapse is a workspace that includes pipelines plus SQL (dedicated and serverless) and Spark. For data engineering only, ADF is enough; for integrated analytics and data science, Synapse is often a better fit.

Q: When should I use a data lakehouse instead of a data warehouse?

Use a lakehouse when you need a single store for raw and refined data, multiple workloads (batch, streaming, ML), and schema evolution with ACID. Use a dedicated warehouse when your workload is primarily relational reporting and you do not need the flexibility of a lake.

Q: How do I handle late-arriving data in batch pipelines?

Use watermarks and a lookback window (e.g. WHERE LastModifiedDate > @LastWatermark - 3 days). Merge into Delta on the business key so late rows overwrite or insert correctly. Update watermark to MAX(LastModifiedDate) from the data you read.

Q: What are the main cost drivers in Azure data engineering?

Storage (ADLS, Delta), compute (Synapse dedicated pool, Databricks), data movement (egress, cross-region). Right-size compute, use auto-pause, archive cold data to cheaper storage. Monitor by pipeline and workspace.

Q: How do I enforce data quality in Azure pipelines?

Add validation activities in ADF (row count, null check); use dbt tests or Great Expectations in the transformation layer. Fail the pipeline or route bad data to quarantine so downstream only sees validated data.

Q: Should I use Delta Lake or plain Parquet in the lake?

Use Delta when you need ACID, time travel, merge/upsert, or schema enforcement. Use plain Parquet when you only append and do not need updates, deletes, or time travel.

Q: What is a watermark in batch pipelines?

A value (e.g. last modified date or sequence ID) that marks how far we have processed. Each run reads only new or changed rows (WHERE WatermarkColumn > LastWatermark), then updates the stored watermark. Enables incremental loads without reprocessing full history.

Q: What is Delta Lake time travel?

Query or restore data as it was at a past version or timestamp. Use VERSION AS OF or TIMESTAMP AS OF in queries. Useful for auditing, debugging, and rollback. Use vacuum with care so you do not delete needed history.

Q: How do I choose between Synapse dedicated SQL pool and serverless?

Dedicated: predictable, high-volume analytical workloads; reserved compute (DWU); use auto-pause when idle. Serverless: ad-hoc querying over files in the lake; pay per query (data scanned). Use dedicated for steady reporting; serverless to query the lake without loading.

Q: What is idempotency in data pipelines?

Re-running the pipeline for the same window or key produces the same result and does not duplicate data. Use merge (upsert) on a business key or replace partition so re-runs overwrite or upsert correctly. Essential when pipelines retry or re-run.

Data engineering on Azure: lakehouse, pipelines, Delta Lake, and incremental loads.

August 30, 2024 · Waqas Ahmad

Read the article

Introduction

This guidance is relevant when the topic of this article applies to your system or design choices; it breaks down when constraints or context differ. I’ve applied it in real projects and refined the takeaways over time (as of 2026).

Ingesting, transforming, and serving data at scale on Azure requires choosing the right building blocks—Data Factory, Synapse, Delta Lake, Lakehouse, Event Hubs—or cost, latency, and maintainability suffer. This article explains what each piece is, when to use it, how to design pipelines and lakehouse patterns, incremental loads and watermarks, and how to avoid common mistakes. For architects and tech leads, designing for batch first, enforcing schema evolution and data quality from day one, and securing and monitoring cost keeps the data platform maintainable.

If you are new, start with Topics covered and Data engineering at a glance.

For more on data and cloud pipelines, see the Cloud-Native Architecture resource.

Topics covered

Decision Context
What is data engineering and why it matters
Data engineering at a glance
Data lake vs data warehouse vs lakehouse (newbie-friendly)
When to choose which: Data Factory vs Synapse vs Lakehouse
Azure Data Factory in depth: pipelines, activities, parameters

Decision Context

System scale: Data pipelines and lakehouse on Azure (Data Factory, Synapse, Data Lake, Delta Lake); from single-pipeline ingestion to multi-source ETL and analytics; applies when you’re building or operating data pipelines and a lakehouse.
Team size: Data engineering and/or platform team; ownership of pipelines, lake layers (bronze/silver/gold), and downstream consumers (BI, ML) must be clear.
Time / budget pressure: Fits greenfield and incremental; breaks down when “we’ll add quality checks later” and data debt grows.
Technical constraints: Azure Data Factory, Synapse, Data Lake Gen2, Delta Lake, Spark; .NET or Python for custom logic; assumes you can enforce schema and lineage.
Non-goals: This article does not optimize for real-time streaming only or for non-Azure data platforms; it optimises for batch and hybrid pipelines and lakehouse on Azure.

What is data engineering and why it matters

Data engineering is the practice of building pipelines and storage so that raw data from databases, files, or events becomes reliable, queryable, and available to analysts, data scientists, and applications. Think of it as plumbing: you move data from source A to destination B, transform it along the way (clean, aggregate, join), and store it so that others can query or train models.

Why it matters: Without data engineering, every team would copy data manually, use inconsistent formats, and struggle to scale. Pipelines automate ingestion and transformation; data lakes and warehouses give a single place to store and query; lakehouse combines lake flexibility with warehouse-like reliability. On Azure, you use Data Factory to orchestrate pipelines, Synapse or Databricks for compute, and Data Lake Storage plus Delta Lake for a lakehouse so that your data platform scales with your business.

Data engineering at a glance

Concept	What it is (simple)
Data lake	A store for raw and refined data as files (e.g. Parquet, CSV) in object storage (e.g. Azure Data Lake Storage). Flexible; no fixed schema up front.
Data warehouse	A relational store optimised for analytical queries (aggregations, joins). Schema-first; e.g. Synapse dedicated SQL pool.
Lakehouse	Lake + warehouse-like features: ACID transactions, schema enforcement, time travel. One store for batch, streaming, and ML; e.g. Delta Lake on ADLS.
Pipeline	A workflow that moves and transforms data: copy from source to lake, then run Spark or SQL to refine. Orchestrated by Data Factory or Synapse.
Watermark	A value (e.g. last modified date) that marks how far we have processed. Used for incremental loads so we do not reprocess full history every run.
Delta Lake	Open format on top of Parquet that adds a transaction log, ACID, time travel, and merge/upsert. Used to build a lakehouse on ADLS.

Loading diagram…

Data lake vs data warehouse vs lakehouse (newbie-friendly)

Data lake: Imagine a large folder (object storage) where you dump files—CSV, JSON, Parquet—from many sources. You do not have to define a table schema first; you can add new columns or new files anytime. Pros: flexible, cheap storage, good for raw data. Cons: without discipline, it becomes a “data swamp” (messy, no quality). Azure: Azure Data Lake Storage (ADLS).

Data warehouse: A relational database optimised for analytical queries (big joins, aggregations). You define tables and schemas; data is loaded into tables and queried with SQL. Pros: fast queries, strong consistency, familiar SQL. Cons: less flexible for raw/unstructured data; loading and schema changes can be heavy. Azure: Synapse dedicated SQL pool (formerly SQL DW).

Lakehouse: A single store that behaves like a lake (files, flexibility) but adds ACID transactions, schema enforcement, and time travel so that you can run both batch and streaming and ML on the same data without copying into a warehouse. Azure: Delta Lake on ADLS, with compute from Synapse Spark or Databricks.

When to use which: Use a lake when you need a cheap, flexible landing zone for raw data. Use a warehouse when your main workload is relational reporting and you do not need lake flexibility. Use a lakehouse when you want one store for raw and refined data, multiple workloads (batch, streaming, ML), and schema evolution with ACID.

When to choose which: Data Factory vs Synapse vs Lakehouse

Azure Data Factory (ADF) is the orchestration layer: it moves data (copy activities) and orchestrates transformations (run Spark, run SQL, run mapping data flows). It does not store or query data itself—it coordinates work. Use ADF when you need batch pipelines, scheduling, parameters (e.g. date range), and linked services (connections to SQL, ADLS, etc.). You can use ADF alone for simple copy-and-load; for heavy transformation, you call Synapse or Databricks from ADF.

Azure Synapse Analytics is a workspace that includes: (1) Synapse pipelines (same engine as ADF—pipelines are pipelines), (2) Dedicated SQL pool (MPP data warehouse for heavy analytical queries), (3) Serverless SQL (query files in the lake with SQL, pay per query), (4) Synapse Spark (Spark for transformation and ML). Use Synapse when you want one place for pipelines, SQL, and Spark and integrated analytics and data science. Use dedicated pool when you have predictable, high-volume reporting; use serverless when you want to query the lake without loading into a warehouse; use Spark when you need Delta Lake, streaming, or ML in the same workspace.

Lakehouse (e.g. Delta Lake on ADLS with Synapse Spark or Databricks) is an architecture: your data lives as Delta tables (Parquet + transaction log) in the lake, and you run batch, streaming, and ML on the same store. Use a lakehouse when you need a single source of truth for raw and refined data, ACID and time travel, and multiple workloads over the same data. The “choice” is often: do you need a lakehouse (Delta) or is a warehouse (dedicated pool) enough? If you need flexibility, schema evolution, and one store for many workloads, choose lakehouse.

Azure Data Factory in depth: pipelines, activities, parameters

A pipeline in ADF is a JSON-defined workflow (you edit in the UI or export as ARM/Bicep). It contains activities such as Copy data, Execute pipeline, Lookup, If condition, ForEach. A typical pattern: Lookup to get the last watermark from a control table → Copy data to move only new rows from source to lake (using the watermark in the query) → Execute pipeline or Notebook to run Spark/SQL transformation → Stored procedure or Lookup to update the watermark.

Parameters and variables make pipelines reusable: e.g. PipelineStartTime, PipelineEndTime, TableName. Pass them at runtime or from a trigger so that the same pipeline can run for different dates or tables.

Example (conceptual): A pipeline that runs daily and loads only new or changed rows from a SQL table into the lake. The source query uses a watermark (e.g. WHERE LastModifiedDate > @Watermark). The watermark is stored in a small control table (e.g. PipelineControl) with columns PipelineName, TableName, LastWatermark. Each run: (1) Lookup LastWatermark for this table, (2) Copy data with the watermark in the query, (3) Update LastWatermark to max(LastModifiedDate) from the copied data. That way, the next run continues from where the previous one left off—incremental load.

YAML-like structure (simplified; ADF uses JSON):

# Conceptual pipeline structure (ADF uses JSON; this is for illustration)
name: IncrementalLoad_Sales
parameters:
  - name: TableName
    type: string
  - name: WatermarkColumn
    type: string
    default: LastModifiedDate
activities:
  - name: GetWatermark
    type: Lookup
    source: ControlTable
    query: "SELECT LastWatermark FROM PipelineControl WHERE TableName = @TableName"
  - name: CopyIncremental
    type: Copy
    source: SQL
    query: "SELECT * FROM @TableName WHERE @WatermarkColumn > @LastWatermark"
    sink: DataLake
    format: Parquet
  - name: UpdateWatermark
    type: StoredProcedure
    # Update PipelineControl SET LastWatermark = @NewWatermark WHERE TableName = @TableName

In real ADF, you define this in the Author canvas or as ARM template; the idea is the same: Lookup → Copy with watermark → Update watermark.

Synapse Analytics in depth: dedicated pool vs serverless vs Spark

Dedicated SQL pool (formerly SQL Data Warehouse): A massively parallel processing (MPP) data warehouse. You create tables, load data (e.g. from the lake via PolyBase or COPY), and run SQL for reporting. You pay for reserved compute (DWU); use auto-pause to save cost when idle. Use it when you have steady, heavy analytical workloads and want predictable performance.

Serverless SQL pool: No dedicated compute. You run SQL directly over files in the lake (Parquet, CSV, Delta). You pay per query (data scanned). Use it when you want to query the lake without loading into a warehouse—ad-hoc exploration, lightweight reporting, or as a thin SQL layer over Delta.

Synapse Spark: Apache Spark in the Synapse workspace. You run notebooks or Spark jobs for transformation (e.g. Delta merge, aggregations), streaming (Structured Streaming), and ML. Use it when you need Delta Lake, complex transforms, or ML in the same workspace as your pipelines and SQL.

Summary: Dedicated = reserved warehouse for heavy SQL. Serverless = pay-per-query SQL over the lake. Spark = Delta, streaming, ML.

Lakehouse and Delta Lake in depth: ACID, time travel, merge

Delta Lake is an open-source storage layer that sits on top of Parquet in your data lake. It adds:

ACID transactions: Multiple writers can append or merge without corrupting data; each write is a transaction with a version.
Time travel: Query or restore data as it was at a past version or timestamp. Useful for auditing, debugging, and rollback.
Schema enforcement and evolution: You can enforce a schema (reject bad data) and evolve it (add columns with mergeSchema).
Merge (upsert): Insert new rows and update existing rows in one operation, using a key (e.g. CustomerId). Essential for incremental and idempotent loads.

Where it runs: Delta works with Spark. On Azure, you use Synapse Spark or Databricks to read/write Delta tables stored in ADLS. The files are still in your storage account; Delta adds a _delta_log folder (transaction log) next to the Parquet data.

Example: merge (upsert) in Delta (Spark SQL):

-- Merge new/changed rows from a staging table into the main Delta table
MERGE INTO sales_delta AS target
USING staging_sales AS source
ON target.SaleId = source.SaleId
WHEN MATCHED AND target.LastModifiedDate < source.LastModifiedDate THEN
  UPDATE SET target.Amount = source.Amount, target.LastModifiedDate = source.LastModifiedDate
WHEN NOT MATCHED THEN
  INSERT (SaleId, CustomerId, Amount, SaleDate, LastModifiedDate)
  VALUES (source.SaleId, source.CustomerId, source.Amount, source.SaleDate, source.LastModifiedDate);

Example: time travel (query past version):

-- Query data as it was at version 5
SELECT * FROM sales_delta VERSION AS OF 5;

-- Query data as it was at a timestamp
SELECT * FROM sales_delta TIMESTAMP AS OF '2024-06-01 00:00:00';

Vacuum: Delta keeps old files for time travel. VACUUM removes files older than a retention period (e.g. 7 days). Run with care so you do not delete history that is still needed.

Pipeline design: batch vs streaming (with examples)

Batch pipelines: Run on a schedule (e.g. daily, hourly). They copy and transform data in chunks. Use batch when latency of hours is acceptable (e.g. nightly reporting, daily aggregates). Design for idempotency: re-running the same pipeline for the same time window should not duplicate data—use merge (upsert) or replace partition so that re-runs overwrite or upsert correctly.

Streaming: Data is processed as it arrives (e.g. Event Hubs → Spark Streaming or Azure Stream Analytics). Use streaming when you need low latency (real-time dashboards, alerting). Trade-offs: more complex (checkpointing, backpressure, exactly-once); more cost. Only add streaming when the business requirement justifies it; otherwise micro-batch (e.g. run a batch pipeline every 15 minutes) is simpler and cheaper.

Example: batch pipeline flow (conceptual):

Trigger: Schedule trigger, e.g. daily at 2 AM.
Get watermark: Lookup LastWatermark from control table (e.g. 2024-06-14).
Copy: Copy from SQL where LastModifiedDate > '2024-06-14' to ADLS as Parquet (e.g. /raw/sales/2024/06/15/data.parquet).
Transform: Run Synapse Spark or Databricks notebook: read Parquet, clean/aggregate, write to Delta with merge on SaleId.
Update watermark: Set LastWatermark = max(LastModifiedDate) from this run in the control table.

Example: streaming (high level): Event Hubs → Spark Structured Streaming job that reads events, aggregates by key, and writes to Delta with foreachBatch or Delta sink. Checkpoints in ADLS so that on restart the job continues from the last offset.

Incremental loads and watermarks (step-by-step)

Why incremental: Loading all data every run is slow and expensive. Incremental load means: only read new or changed rows since the last run. You need a watermark—a column that always increases (e.g. LastModifiedDate, SequenceId).

Step-by-step:

Choose a watermark column in the source (e.g. LastModifiedDate). Ensure it is updated whenever a row is inserted or updated.
Store the last watermark somewhere: a small table (e.g. PipelineControl) or a file in the lake. Columns: PipelineName, TableName, LastWatermark.
At pipeline start: Read LastWatermark (e.g. 2024-06-14 00:00:00). If first run, use a default (e.g. 1900-01-01).
Source query: SELECT * FROM Sales WHERE LastModifiedDate > @LastWatermark ORDER BY LastModifiedDate. Copy only these rows to the lake or staging.
After successful copy: Compute NewWatermark = MAX(LastModifiedDate) from the copied data (or from source in a separate query). Update PipelineControl set LastWatermark = NewWatermark for this table.
Next run: Uses the updated watermark, so only newer rows are copied. Idempotency: If you use merge in Delta (on SaleId or business key), re-running the same window will upsert and not duplicate.

Late-arriving data: Sometimes rows arrive “late” (e.g. LastModifiedDate from yesterday but the pipeline already ran). To catch them, use a lookback: e.g. read WHERE LastModifiedDate > @LastWatermark - 3 days and then merge into Delta so that late rows overwrite or insert correctly. The watermark still advances by MAX(LastModifiedDate) from the data you read, not the clock.

Schema evolution and data quality (with code)

Schema evolution: Over time, the source may add columns, rename, or change types. In Delta Lake, you can:

Add new columns: Use mergeSchema: true when writing so that new columns in the DataFrame are added to the table. Existing data gets null for the new column.
Overwrite schema: overwriteSchema: true can replace the table schema—use with care, as it can break existing readers.
Rename / type change: Use ALTER TABLE (Delta supports a subset) or create a new table and migrate.

Example (Spark / PySpark):

# Write with schema evolution: new columns in df are added to the Delta table
df.write   .format("delta")   .mode("append")   .option("mergeSchema", "true")   .save("/mnt/lake/sales")

Data quality: Ensure invalid data does not pollute downstream. Options:

Validation in the pipeline: In ADF, add a validation activity (e.g. row count, null check). If it fails, fail the pipeline and do not update the watermark.
dbt tests: If you use dbt for transformations, define tests (e.g. not_null, unique, accepted_values). Run dbt in Synapse or Databricks; failing tests can block the pipeline.
Great Expectations: Define expectations (e.g. “column X must be non-null”) and run them in a notebook or job. Fail the pipeline if expectations fail.
Quarantine: Route rows that fail validation to a quarantine container or table. Downstream only reads from the “good” path. Fix and backfill quarantined data later.

Example (conceptual) validation in ADF: After Copy, run a Lookup that executes SELECT COUNT(*) FROM staging WHERE Amount < 0. If count > 0, fail the pipeline (use If condition → Fail activity). Alternatively, run a Synapse Notebook that runs Great Expectations and raises an exception if expectations fail.

Security, cost, and operations

Security: Use Managed Identity for ADF, Synapse, and Databricks to access ADLS and SQL—no stored passwords. Use RBAC and ACLs on the data lake so that only authorised identities can read/write. In Synapse, use column-level security or dynamic data masking for PII. Never store credentials in code; use Azure Key Vault and reference secrets from linked services.

Cost: Main drivers: storage (ADLS, Delta), compute (Synapse dedicated pool, Databricks clusters), data movement (egress, cross-region). Right-size compute: use auto-pause for dedicated pool; use autoscale or job clusters for Databricks so you do not leave clusters running 24/7. Reserved capacity can reduce cost for predictable workloads. Monitor cost by resource group, workspace, and pipeline so you can attribute spend.

Operations: Logging and monitoring: send pipeline runs and Spark job logs to Log Analytics or Application Insights. Alert on failed pipelines or long-running jobs. Version pipeline definitions (ARM/Bicep or Git integration) so that changes are reviewable. Use parameters and variable groups so that dev/test/prod differ only by config, not by copy-pasted pipelines.

Best practices and pitfalls

Do not over-engineer for real-time: Many teams build streaming when daily or hourly batch would meet the requirement. Streaming is harder to operate and debug. Start with batch; add streaming only when there is a clear need (e.g. real-time alerting).

Do not under-invest in schema and contracts: Without a clear schema and evolution strategy, the lake becomes a swamp. Define expected columns and types (e.g. in a contract or Delta schema); use mergeSchema carefully and version schema in code.

Avoid copy-paste pipelines: Duplicating the same pipeline for every table or environment leads to drift and bugs. Use parameters (table name, watermark column), variables, and template pipelines or dbt so that logic is defined once and reused.

Avoid cost explosion: Auto-pause dedicated pool; use job clusters or autoscale for Spark. Monitor cost by pipeline and workspace; set budgets and alerts.

Design for idempotency: Re-running a pipeline for the same window should not duplicate data. Use merge (upsert) with a business key or replace partition so that re-runs overwrite or upsert correctly.

Position & Rationale

I use Data Factory for orchestration (pipelines, triggers, dependencies); I use Synapse (dedicated pool, serverless, Spark) or Databricks for compute depending on existing commitment and need for Spark. I use Delta Lake on ADLS for a lakehouse when I need ACID, time travel, and one store for batch, streaming, and ML. I design for batch first unless streaming is clearly justified; I use incremental loads and watermarks to avoid full reprocessing. I enforce schema evolution and data quality from day one so the platform stays maintainable. I secure and monitor cost (e.g. serverless SQL, Spark job size) so we don’t overspend. I reject building a lakehouse before we have a clear use case for ACID and time travel; I reject full refresh when incremental and watermarks would do.

Trade-Offs & Failure Modes

What this sacrifices: Lakehouse and multiple compute options add complexity and cost; schema evolution and quality require discipline and tooling.
Where it degrades: When we run full refresh every time—cost and latency. When we add streaming without a real need—complexity. When we don’t monitor serverless or Spark cost—bills spike. When schema changes break downstream—we need evolution and contracts.
How it fails when misapplied: Using Synapse or Databricks for tiny one-off loads—overkill. Using Delta when we only need raw blob storage—extra cost. Skipping watermarks so every run reprocesses everything. No data quality checks so bad data propagates.
Early warning signs: “Every pipeline runs for hours”; “our Azure data bill doubled”; “downstream broke when we added a column”; “we don’t know what’s in the lake.”

What Most Guides Miss

Guides often show one service (e.g. Data Factory) and skip when to choose which—orchestration vs compute vs storage. Incremental loads and watermarks are underplayed; without them we reprocess everything and cost and latency blow up. Schema evolution (adding columns, compatibility) and data quality (checks, alerts) are rarely built in from the start; retrofitting is painful. Cost control (serverless SQL, Spark autoscale, retention) is often ignored until the bill arrives. The lakehouse (Delta on ADLS) is a good fit when we need ACID and time travel; it’s not mandatory for every data lake—raw blob and folder convention can be enough for simple batch.

Decision Framework

If you need orchestration → Data Factory (pipelines, triggers, dependencies).
If you need compute → Synapse (dedicated, serverless, Spark) or Databricks; choose by existing stack and Spark need.
If you need ACID, time travel, and one store for batch/streaming/ML → Delta Lake on ADLS (lakehouse).
If you’re doing batch → Prefer incremental loads and watermarks; avoid full refresh unless required.
If you’re adding streaming → Justify with a real use case; otherwise batch is simpler and cheaper.
If cost is a concern → Set retention, tune serverless and Spark; monitor and alert on spend.

Summary

Data engineering on Azure requires clear choices: Data Factory for orchestration; Synapse or Databricks for compute; Delta Lake on ADLS for a lakehouse when you need ACID, time travel, and one store for batch, streaming, and ML—design for batch first, use incremental loads and watermarks, and enforce schema evolution and data quality from day one. Getting these wrong leads to full reprocessing, inconsistent data, or runaway cost; securing and monitoring from the start keeps the platform maintainable and cost-effective. Next, map your sources and consumers, then choose Data Factory vs Synapse vs Lakehouse and add incremental and quality checks before scaling; use the FAQs below as a quick reference.

When I Would Use This Again — and When I Wouldn’t

I would use this approach again when I’m building or operating data pipelines and a lakehouse on Azure (Data Factory, Synapse, Data Lake, Delta) and need layered storage and idempotent ETL. I wouldn’t use it for real-time-only streaming (then focus on Event Hubs and stream processing). I also wouldn’t skip data quality or lineage; bad data and unknown dependencies cost more later. Alternative: for a single-source, small dataset, a single pipeline and one layer may suffice; add bronze/silver/gold as sources and consumers grow.

Frequently Asked Questions

What is the difference between Azure Data Factory and Synapse pipelines?

Data Factory and Synapse pipelines use the same engine. Synapse is a workspace that includes pipelines plus SQL (dedicated and serverless) and Spark. So: pipelines are pipelines; the difference is that in Synapse you develop and run them alongside your data and notebooks. For data engineering only (copy, orchestrate), ADF alone is enough. For integrated analytics and data science (SQL + Spark + pipelines in one place), Synapse is often a better fit.

When should I use a data lakehouse instead of a data warehouse?

Use a lakehouse when you need: (1) a single store for raw and refined data, (2) multiple workloads (batch, streaming, ML) over the same data, (3) schema evolution and ACID (e.g. Delta Lake). Use a dedicated warehouse (e.g. Synapse dedicated SQL pool) when your workload is primarily relational reporting and you do not need the flexibility of a lake. If you need both flexibility and reliability over the same data, choose lakehouse.

How do I handle late-arriving data in batch pipelines?

Use watermarks and a lookback window. For example: each run reads WHERE LastModifiedDate > @LastWatermark - 3 days (lookback) and then merges into Delta on the business key. Late-arriving rows (e.g. with yesterday’s date) are still in that window and get merged. Update the watermark to MAX(LastModifiedDate) from the data you read so the next run does not reprocess the same rows. Design the merge so that the latest row wins (e.g. WHEN MATCHED AND source.LastModifiedDate > target.LastModifiedDate THEN UPDATE).

What are the main cost drivers in Azure data engineering?

Storage (ADLS, Delta)—cheap. Compute (Synapse dedicated pool, Databricks clusters)—often the largest cost; use auto-pause, right-sizing, job clusters. Data movement (egress, cross-region copy)—can add up. Right-size compute, monitor by pipeline and workspace, and archive cold data to cheaper tiers.

How do I enforce data quality in Azure pipelines?

In the pipeline: Add validation activities (e.g. row count, null check) in ADF; if validation fails, fail the pipeline and do not update the watermark. In the transformation layer: Use dbt tests or Great Expectations in Synapse/Databricks; fail the job if tests fail. Quarantine: Route bad rows to a separate container or table so downstream only sees validated data; fix and backfill later.

Should I use Delta Lake or plain Parquet in the lake?

Use Delta Lake when you need ACID, time travel, merge/upsert, or schema enforcement. Use plain Parquet when you only append and do not need updates, deletes, or time travel. Delta adds a transaction log and metadata but simplifies many operational scenarios (re-runs, late data, audits).

What is a watermark in batch pipelines?

A watermark is a value (e.g. the maximum LastModifiedDate or SequenceId) that marks how far we have already processed. Each pipeline run: (1) reads the last watermark from a control table, (2) copies only rows where the watermark column greater than the last watermark, (3) updates the stored watermark to the new maximum. That way, the next run continues from where the previous one left off—incremental load without reprocessing full history.

What is Delta Lake time travel?

Time travel in Delta Lake lets you query or restore data as it was at a past version or timestamp. For example: SELECT * FROM my_table VERSION AS OF 5 or TIMESTAMP AS OF '2024-06-01'. Useful for auditing, debugging, and rollback. Delta keeps old data files until you run VACUUM; use vacuum with care so you do not delete history that is still needed.

How do I choose between Synapse dedicated SQL pool and serverless?

Dedicated SQL pool: For predictable, high-volume analytical workloads. You pay for reserved compute (DWU); use auto-pause when idle. Serverless SQL pool: For ad-hoc querying over files in the lake (Parquet, Delta). You pay per query (data scanned). Use dedicated when you have steady, heavy reporting; use serverless when you want to query the lake without loading into a warehouse.

What is idempotency in data pipelines?

Idempotency means: re-running the pipeline for the same time window or key produces the same result and does not duplicate data. Achieve it by merge (upsert) on a business key (e.g. SaleId) or by replacing a partition (e.g. overwrite date=2024-06-15). Essential when pipelines retry after failure or when you re-run manually.

How do I secure a data lake on Azure?

Use Managed Identity for ADF, Synapse, and Databricks so they access ADLS without stored passwords. Use RBAC and ACLs on the storage account so only authorised identities can read/write. Use column-level security or dynamic data masking in Synapse for PII. Store secrets in Azure Key Vault and reference them from linked services. Never put credentials in code or in pipeline parameters in plain text.

What is dbt and when do I use it with Azure?

dbt (data build tool) is a framework for transformation logic as code (SQL + Jinja templates). You define models (e.g. stg_sales, fct_orders), tests (e.g. not null, unique), and documentation. It runs on top of your warehouse or Spark (e.g. Synapse, Databricks). Use dbt when your transform layer is SQL-based and you want versioned, testable transformations and incremental models. Pair with Azure by running dbt in Synapse Spark or Databricks.

How do I handle schema evolution in Delta Lake?

Add columns: Use mergeSchema: true when writing so that new columns in your DataFrame are added to the Delta table; existing data gets null for the new column. Overwrite schema: overwriteSchema: true can replace the table schema—use with care. Rename or change type: Use ALTER TABLE where supported, or create a new table and migrate. Version your schema in code (e.g. in dbt or in notebook config) and test evolution in dev before prod.

Related Guides & Resources

Part of cluster

Cloud & Azure — explore related topics:

Azure Cloud Architecture Microservices Architecture

Waqas Ahmad — Software Architect & Technical Consultant

Distributed Systems

Article

Data Engineering on Azure: Pipelines, Lakehouse, and Analytics at Scale

Read the article

Introduction

Topics covered

Decision Context

What is data engineering and why it matters

Data engineering at a glance

Data lake vs data warehouse vs lakehouse (newbie-friendly)

When to choose which: Data Factory vs Synapse vs Lakehouse

Azure Data Factory in depth: pipelines, activities, parameters

Synapse Analytics in depth: dedicated pool vs serverless vs Spark

Lakehouse and Delta Lake in depth: ACID, time travel, merge

Pipeline design: batch vs streaming (with examples)

Incremental loads and watermarks (step-by-step)

Schema evolution and data quality (with code)

Security, cost, and operations

Best practices and pitfalls

Position & Rationale

Trade-Offs & Failure Modes

What Most Guides Miss

Decision Framework

Summary

When I Would Use This Again — and When I Wouldn’t

Frequently Asked Questions

Frequently Asked Questions

What is the difference between Azure Data Factory and Synapse pipelines?

When should I use a data lakehouse instead of a data warehouse?

How do I handle late-arriving data in batch pipelines?

What are the main cost drivers in Azure data engineering?

How do I enforce data quality in Azure pipelines?

Should I use Delta Lake or plain Parquet in the lake?

What is a watermark in batch pipelines?

What is Delta Lake time travel?

How do I choose between Synapse dedicated SQL pool and serverless?

What is idempotency in data pipelines?

How do I secure a data lake on Azure?

What is dbt and when do I use it with Azure?

How do I handle schema evolution in Delta Lake?

Related Guides & Resources

Related articles

Part of cluster

Related services