Batch for reports, ETL where hours latency is fine. Streaming for real-time dashboards, alerts.

Tracks processing progress. Marks last processed timestamp or declares all events before X have arrived.

Groups events by time. Types: tumbling, sliding, session.

How monitor streaming lag?

Track event time vs processing time. Alert if lag exceeds threshold.

Data Engineering: Batch vs Streaming In-Depth

Q: What is batch processing?

Runs on schedule (hourly, daily) and processes data in windows. Use when latency of hours is acceptable.

Q: What is streaming processing?

Runs continuously and processes events as they arrive. Use when latency of seconds or minutes is required.

Q: Event time vs processing time?

Event time is when it occurred. Processing time is when processed. Use event time for accuracy.

Q: How handle late data?

Watermarks and allowed lateness. Accept events up to N minutes late.

Q: What is Lambda architecture?

Combines batch (accurate) and streaming (fast). Complex but handles both.

Q: What is Kappa architecture?

Streaming only. Replay from event store when needed. Simpler than Lambda.

Q: Azure services for batch?

Data Factory, Synapse Pipelines, Databricks, Azure Batch for ETL and heavy compute.

Introduction

This guidance is relevant when the topic of this article applies to your system or design choices; it breaks down when constraints or context differ. I’ve applied it in real projects and refined the takeaways over time (as of 2026).

Choosing the wrong data-processing paradigm—batch vs streaming—leads to either unnecessary complexity and cost or missed latency and real-time needs. This article explains when to use batch and when to use streaming, common patterns, Azure services for each, and how to handle late data and exactly-once processing. For data engineers and architects, getting the choice right matters for latency, cost, and operational complexity.

For more on event-driven and data pipelines, see the Event-Driven Architecture resource.

Topics covered

Decision Context
Batch vs streaming: overview
When to use batch
When to use streaming
Azure services for batch
Azure services for streaming
Common patterns
Handling late data
Enterprise best practices
Common issues
Summary
Position & Rationale
Trade-Offs & Failure Modes
What Most Guides Miss
Decision Framework
Key Takeaways
When I Would Use This Again — and When I Wouldn’t
Frequently Asked Questions

Decision Context

System scale: Data volumes from GB to TB (or more); processing on a schedule (batch) or continuously (streaming). Applies when you’re designing or refactoring data pipelines and need to choose batch vs streaming.
Team size: Data engineers and platform owners; someone must own pipeline definition, watermarks, and late-data policy. Works when the team can reason about event time vs processing time and consistency trade-offs.
Time / budget pressure: Fits greenfield pipelines and “we need real-time” or “we need nightly reports”; breaks down when requirements are vague (e.g. “as real-time as possible” with no latency target) or when there’s no capacity to operate streaming.
Technical constraints: Azure (Data Factory, Synapse, Event Hubs, Stream Analytics) or similar; batch = scheduled jobs and storage; streaming = event ingestion and continuous processing. Assumes you can handle late data and define watermarks.
Non-goals: This article does not optimise for on-prem only or for a specific vendor’s API; it focuses on when to use batch vs streaming and common patterns.

Batch vs streaming: overview

Aspect	Batch	Streaming
Processing	Scheduled (hourly, daily)	Continuous
Latency	Minutes to hours	Seconds to minutes
Complexity	Lower	Higher
Cost	Pay when running	Pay continuously
Use case	Reports, analytics, ETL	Real-time dashboards, alerts

When to use batch

Batch processing runs on a schedule and processes data in windows (e.g., last 24 hours).

Use batch when:

Latency of hours is acceptable
Processing is compute-heavy (aggregations, ML training)
Data arrives in files (CSV, Parquet)
Cost matters more than latency

Examples:

Daily sales reports
Monthly billing runs
Data warehouse ETL
ML model training

Key concepts:

Incremental load: Only process new/changed data since last run
Watermark: Track what was already processed
Idempotency: Re-running produces same result

When to use streaming

Streaming processing runs continuously and processes events as they arrive.

Use streaming when:

Latency of seconds or minutes is required
Events trigger immediate actions
Real-time dashboards or alerts needed
Data arrives as continuous stream

Examples:

Real-time fraud detection
Live dashboards
IoT sensor processing
Clickstream analytics

Key concepts:

Event time vs processing time: When event occurred vs when processed
Windowing: Group events by time (tumbling, sliding, session)
Watermarks: Handle late-arriving data

Azure services for batch

Service	Use case
Azure Data Factory	Orchestration, ETL pipelines
Azure Synapse Pipelines	Data warehouse ETL
Azure Databricks	Spark batch jobs, ML
Azure Batch	Large-scale parallel compute
Azure Functions (Timer)	Scheduled lightweight jobs

Example: Data Factory pipeline

Copy data from Blob Storage
Transform with Mapping Data Flow
Load into Synapse Analytics
Trigger on schedule (daily at 2 AM)

Azure services for streaming

Service	Use case
Azure Event Hubs	High-throughput event ingestion
Azure Stream Analytics	SQL-based stream processing
Azure Databricks Streaming	Spark Structured Streaming
Azure Functions (Event Hub trigger)	Event-driven compute
Kafka on HDInsight/Confluent	Kafka workloads

Example: Stream Analytics query

-- Tumbling window: count events per minute
SELECT
    System.Timestamp AS WindowEnd,
    COUNT(*) AS EventCount
FROM
    EventHubInput TIMESTAMP BY EventTime
GROUP BY
    TumblingWindow(minute, 1)

Common patterns

Lambda architecture

Combine batch and streaming:

Batch layer: Accurate, complete data (daily)
Speed layer: Real-time approximations
Serving layer: Query both

Use when you need both historical accuracy and real-time speed.

Kappa architecture

Streaming only:

Process everything as streams
Replay from event store when needed

Simpler than Lambda; use when streaming can handle all needs.

Event sourcing + streaming

Store all events; process stream for projections and analytics.

Incremental batch

Process only new data since last watermark:

-- Watermark pattern
SELECT * FROM Orders
WHERE UpdatedAt > @LastWatermark
ORDER BY UpdatedAt

Handling late data

Events can arrive late (network delays, offline devices).

Strategies:

Watermarks: Declare “all data before X has arrived”
Allowed lateness: Accept events up to N minutes late
Reprocessing: Re-run batch for late corrections

Stream Analytics example:

-- Allow events up to 5 minutes late
SELECT *
FROM EventHubInput TIMESTAMP BY EventTime
WHERE DATEDIFF(minute, EventTime, System.Timestamp) <= 5

Enterprise best practices

1. Start with batch; add streaming when needed. Batch is simpler. Add streaming only when latency requirements demand it.

2. Use incremental loads. Do not reprocess everything. Track watermarks; process only new data.

3. Make processing idempotent. Re-running should produce the same result. Use upserts, not inserts.

4. Monitor lag. In streaming, track how far behind processing is. Alert if lag grows.

5. Plan for late data. Define allowed lateness; decide how to handle late corrections.

6. Use dead-letter queues. Capture failed events for investigation and reprocessing.

7. Test with realistic data. Simulate real volumes, including late and out-of-order events.

8. Document SLAs. Define expected latency, throughput, and completeness.

Common issues

Issue	Cause	Fix
High latency	Processing too slow	Scale out; optimize queries
Late data	Events arrive after window	Allow lateness; use watermarks
Duplicates	At-least-once delivery	Idempotent writes; deduplication
Data loss	Processing errors	Dead-letter queue; checkpointing
Cost spike	Unexpected volume	Auto-scale limits; budget alerts
Drift	Batch and streaming differ	Reconciliation; single source of truth

Summary

Batch processes data on a schedule with higher latency but simpler architecture; streaming processes continuously with lower latency but higher complexity—choose by latency requirement and ops capacity. Picking streaming by default when batch would suffice wastes cost and complexity; skipping streaming when the product needs near-real-time leads to wrong architecture. Next, define your latency target (e.g. < 1 min), then choose batch, streaming, or a hybrid and design watermarks and late-data policy if you stream.

Position & Rationale

I use batch when latency of hours (or at least many minutes) is acceptable and when the workload is easier to reason about and debug in scheduled runs—reports, ETL, aggregations that don’t need to be real-time. I use streaming when the business needs low-latency visibility (e.g. dashboards, alerts) or event-driven actions and when we can invest in watermarks, late data, and operational complexity. I avoid defaulting to streaming because it sounds modern; batch is often simpler and cheaper. I also avoid “batch when we could stream” when the product clearly needs near-real-time; in that case I design for streaming and accept the complexity. Hybrid (batch for correctness/reconciliation, streaming for real-time) is common and I use it when both completeness and latency matter.

Trade-Offs & Failure Modes

Batch sacrifices latency and real-time feedback; you gain simpler architecture, easier debugging, and lower operational cost. Streaming sacrifices simplicity and often exact consistency; you gain low latency and continuous processing. Hybrid adds two pipelines to maintain but can deliver both. Failure modes: choosing streaming without a clear latency requirement and then overpaying in complexity; ignoring late data and watermarks and getting wrong results; treating “real-time” as a single bucket instead of defining acceptable delay; running batch and streaming without a reconciliation path so they drift.

What Most Guides Miss

Most guides list “batch vs streaming” features but don’t stress that the real decision is latency requirement and operational capacity. If the business can’t articulate why sub-minute latency matters, batch is often enough. Another gap: late data in streaming is underplayed—events arrive out of order and after the window closes; you need a policy (allow lateness, side outputs, or drop) and to document it. Reconciliation between batch and streaming (e.g. “batch is source of truth, streaming is for real-time view”) is rarely discussed but matters when you run both.

Decision Framework

If latency of hours is acceptable and the workload is reports/ETL → Batch; schedule and run; keep it simple.
If latency of seconds/minutes is required (dashboards, alerts) → Streaming; design event flow, watermarks, and late-data policy.
If both completeness and real-time matter → Hybrid: batch for authoritative aggregates, streaming for real-time view; reconcile periodically.
For late data → Define watermark and lateness policy; use side outputs or allow lateness windows and document the semantics.
For production → Monitor backpressure, lag, and pipeline health; have runbooks for replay and failure.

Key Takeaways

Batch = scheduled, higher latency, simpler; streaming = continuous, lower latency, more complex. Choose by latency requirement and ops capacity.
Don’t default to streaming; batch is often sufficient and cheaper to run.
Late data in streaming needs a clear policy (watermarks, allow lateness, side outputs).
Hybrid (batch + streaming) is valid when you need both correctness and real-time; design reconciliation.
Define “real-time” with a number (e.g. < 1 min) so you can design and measure.

When I Would Use This Again — and When I Wouldn’t

I’d use batch again for reports, ETL, and any workload where hourly or daily latency is fine and I want to minimise operational complexity. I’d use streaming again when the product needs near-real-time (e.g. alerts, live dashboards) and the team can own watermarks and late data. I wouldn’t choose streaming without a clear latency target; “as fast as possible” leads to over-engineering. I also wouldn’t run streaming and batch without a defined reconciliation or source-of-truth story—otherwise the two pipelines drift and no one knows which to trust.

Frequently Asked Questions

What is batch processing?

Batch processing runs on a schedule (hourly, daily) and processes data in windows. Use when latency of hours is acceptable.

What is streaming processing?

Streaming processing runs continuously and processes events as they arrive. Use when latency of seconds or minutes is required.

When should I use batch vs streaming?

Use batch for reports, ETL, and analytics where latency of hours is fine. Use streaming for real-time dashboards, alerts, and event-driven actions.

What is a watermark?

A watermark tracks how far processing has progressed. In batch, it marks the last processed timestamp. In streaming, it declares “all events before X have arrived.”

What is event time vs processing time?

Event time is when the event occurred. Processing time is when it was processed. Use event time for accurate analytics; processing time can vary.

What is windowing?

Windowing groups events by time. Types: tumbling (fixed, non-overlapping), sliding (overlapping), session (activity-based).

How do I handle late data?

Use watermarks and allowed lateness. Accept events up to N minutes late. For corrections, re-run batch or use append-only with latest wins.

What is Lambda architecture?

Lambda combines batch (accurate) and streaming (fast). Serving layer queries both. Complex but handles both accuracy and real-time.

What is Kappa architecture?

Kappa uses streaming only. Replay from event store when needed. Simpler than Lambda; works when streaming can handle all needs.

What Azure services are for batch?

Data Factory, Synapse Pipelines, Databricks, Azure Batch. Use for ETL, data warehouse loads, and heavy compute.

What Azure services are for streaming?

Event Hubs (ingestion), Stream Analytics (SQL processing), Databricks Streaming (Spark), Functions (event-driven).

How do I make processing idempotent?

Use upserts instead of inserts. Include event ID for deduplication. Design so re-running produces same result.

What is exactly-once processing?

Guarantee each event is processed exactly once (no loss, no duplicates). Hard to achieve; often use at-least-once with idempotent writes.

How do I monitor streaming lag?

Track difference between event time and processing time. Alert if lag exceeds threshold.

Should I use both batch and streaming?

Often yes. Batch for historical accuracy and heavy compute; streaming for real-time needs. Lambda architecture formalizes this.

Waqas Ahmad — Software Architect & Technical Consultant

Distributed Systems

Article

Data Engineering: Batch vs Streaming In-Depth

Read the article

Introduction

Topics covered

Decision Context

Batch vs streaming: overview

When to use batch

When to use streaming

Azure services for batch

Azure services for streaming

Common patterns

Lambda architecture

Kappa architecture

Event sourcing + streaming

Incremental batch

Handling late data

Enterprise best practices

Common issues

Summary

Position & Rationale

Trade-Offs & Failure Modes

What Most Guides Miss

Decision Framework

Key Takeaways

When I Would Use This Again — and When I Wouldn’t

Frequently Asked Questions

Frequently Asked Questions

What is batch processing?

What is streaming processing?

When should I use batch vs streaming?

What is a watermark?

What is event time vs processing time?

What is windowing?

How do I handle late data?

What is Lambda architecture?

What is Kappa architecture?

What Azure services are for batch?

What Azure services are for streaming?

How do I make processing idempotent?

What is exactly-once processing?

How do I monitor streaming lag?

Should I use both batch and streaming?

Related Guides & Resources

Related articles

Part of cluster

Related services