👋Hi, I'm Waqas — a Software Architect and Technical Consultant specializing in .NET, Azure, microservices, and API-first system design..
I help companies build reliable, maintainable, and high-performance backend platforms that scale.
Batch vs streaming in data engineering: when to use which and watermarks.
September 12, 2024 · Waqas Ahmad
Read the article
Introduction
This guidance is relevant when the topic of this article applies to your system or design choices; it breaks down when constraints or context differ. I’ve applied it in real projects and refined the takeaways over time (as of 2026).
Choosing the wrong data-processing paradigm—batch vs streaming—leads to either unnecessary complexity and cost or missed latency and real-time needs. This article explains when to use batch and when to use streaming, common patterns, Azure services for each, and how to handle late data and exactly-once processing. For data engineers and architects, getting the choice right matters for latency, cost, and operational complexity.
System scale: Data volumes from GB to TB (or more); processing on a schedule (batch) or continuously (streaming). Applies when you’re designing or refactoring data pipelines and need to choose batch vs streaming.
Team size: Data engineers and platform owners; someone must own pipeline definition, watermarks, and late-data policy. Works when the team can reason about event time vs processing time and consistency trade-offs.
Time / budget pressure: Fits greenfield pipelines and “we need real-time” or “we need nightly reports”; breaks down when requirements are vague (e.g. “as real-time as possible” with no latency target) or when there’s no capacity to operate streaming.
Technical constraints: Azure (Data Factory, Synapse, Event Hubs, Stream Analytics) or similar; batch = scheduled jobs and storage; streaming = event ingestion and continuous processing. Assumes you can handle late data and define watermarks.
Non-goals: This article does not optimise for on-prem only or for a specific vendor’s API; it focuses on when to use batch vs streaming and common patterns.
Batch vs streaming: overview
Aspect
Batch
Streaming
Processing
Scheduled (hourly, daily)
Continuous
Latency
Minutes to hours
Seconds to minutes
Complexity
Lower
Higher
Cost
Pay when running
Pay continuously
Use case
Reports, analytics, ETL
Real-time dashboards, alerts
When to use batch
Batch processing runs on a schedule and processes data in windows (e.g., last 24 hours).
Use batch when:
Latency of hours is acceptable
Processing is compute-heavy (aggregations, ML training)
Data arrives in files (CSV, Parquet)
Cost matters more than latency
Examples:
Daily sales reports
Monthly billing runs
Data warehouse ETL
ML model training
Key concepts:
Incremental load: Only process new/changed data since last run
Watermark: Track what was already processed
Idempotency: Re-running produces same result
When to use streaming
Streaming processing runs continuously and processes events as they arrive.
Use streaming when:
Latency of seconds or minutes is required
Events trigger immediate actions
Real-time dashboards or alerts needed
Data arrives as continuous stream
Examples:
Real-time fraud detection
Live dashboards
IoT sensor processing
Clickstream analytics
Key concepts:
Event time vs processing time: When event occurred vs when processed
Windowing: Group events by time (tumbling, sliding, session)
Watermarks: Handle late-arriving data
Azure services for batch
Service
Use case
Azure Data Factory
Orchestration, ETL pipelines
Azure Synapse Pipelines
Data warehouse ETL
Azure Databricks
Spark batch jobs, ML
Azure Batch
Large-scale parallel compute
Azure Functions (Timer)
Scheduled lightweight jobs
Example: Data Factory pipeline
Copy data from Blob Storage
Transform with Mapping Data Flow
Load into Synapse Analytics
Trigger on schedule (daily at 2 AM)
Azure services for streaming
Service
Use case
Azure Event Hubs
High-throughput event ingestion
Azure Stream Analytics
SQL-based stream processing
Azure Databricks Streaming
Spark Structured Streaming
Azure Functions (Event Hub trigger)
Event-driven compute
Kafka on HDInsight/Confluent
Kafka workloads
Example: Stream Analytics query
-- Tumbling window: count events per minuteSELECT
System.Timestamp AS WindowEnd,
COUNT(*) AS EventCount
FROM
EventHubInput TIMESTAMPBY EventTime
GROUPBY
TumblingWindow(minute, 1)
Common patterns
Lambda architecture
Combine batch and streaming:
Batch layer: Accurate, complete data (daily)
Speed layer: Real-time approximations
Serving layer: Query both
Use when you need both historical accuracy and real-time speed.
Kappa architecture
Streaming only:
Process everything as streams
Replay from event store when needed
Simpler than Lambda; use when streaming can handle all needs.
Event sourcing + streaming
Store all events; process stream for projections and analytics.
Incremental batch
Process only new data since last watermark:
-- Watermark patternSELECT*FROM Orders
WHERE UpdatedAt >@LastWatermarkORDERBY UpdatedAt
Handling late data
Events can arrive late (network delays, offline devices).
Strategies:
Watermarks: Declare “all data before X has arrived”
Allowed lateness: Accept events up to N minutes late
Reprocessing: Re-run batch for late corrections
Stream Analytics example:
-- Allow events up to 5 minutes lateSELECT*FROM EventHubInput TIMESTAMPBY EventTime
WHERE DATEDIFF(minute, EventTime, System.Timestamp) <=5
Enterprise best practices
1. Start with batch; add streaming when needed. Batch is simpler. Add streaming only when latency requirements demand it.
2. Use incremental loads. Do not reprocess everything. Track watermarks; process only new data.
3. Make processing idempotent. Re-running should produce the same result. Use upserts, not inserts.
4. Monitor lag. In streaming, track how far behind processing is. Alert if lag grows.
5. Plan for late data. Define allowed lateness; decide how to handle late corrections.
6. Use dead-letter queues. Capture failed events for investigation and reprocessing.
7. Test with realistic data. Simulate real volumes, including late and out-of-order events.
8. Document SLAs. Define expected latency, throughput, and completeness.
Common issues
Issue
Cause
Fix
High latency
Processing too slow
Scale out; optimize queries
Late data
Events arrive after window
Allow lateness; use watermarks
Duplicates
At-least-once delivery
Idempotent writes; deduplication
Data loss
Processing errors
Dead-letter queue; checkpointing
Cost spike
Unexpected volume
Auto-scale limits; budget alerts
Drift
Batch and streaming differ
Reconciliation; single source of truth
Summary
Batch processes data on a schedule with higher latency but simpler architecture; streaming processes continuously with lower latency but higher complexity—choose by latency requirement and ops capacity. Picking streaming by default when batch would suffice wastes cost and complexity; skipping streaming when the product needs near-real-time leads to wrong architecture. Next, define your latency target (e.g. < 1 min), then choose batch, streaming, or a hybrid and design watermarks and late-data policy if you stream.
Position & Rationale
I use batch when latency of hours (or at least many minutes) is acceptable and when the workload is easier to reason about and debug in scheduled runs—reports, ETL, aggregations that don’t need to be real-time. I use streaming when the business needs low-latency visibility (e.g. dashboards, alerts) or event-driven actions and when we can invest in watermarks, late data, and operational complexity. I avoid defaulting to streaming because it sounds modern; batch is often simpler and cheaper. I also avoid “batch when we could stream” when the product clearly needs near-real-time; in that case I design for streaming and accept the complexity. Hybrid (batch for correctness/reconciliation, streaming for real-time) is common and I use it when both completeness and latency matter.
Trade-Offs & Failure Modes
Batch sacrifices latency and real-time feedback; you gain simpler architecture, easier debugging, and lower operational cost. Streaming sacrifices simplicity and often exact consistency; you gain low latency and continuous processing. Hybrid adds two pipelines to maintain but can deliver both. Failure modes: choosing streaming without a clear latency requirement and then overpaying in complexity; ignoring late data and watermarks and getting wrong results; treating “real-time” as a single bucket instead of defining acceptable delay; running batch and streaming without a reconciliation path so they drift.
What Most Guides Miss
Most guides list “batch vs streaming” features but don’t stress that the real decision is latency requirement and operational capacity. If the business can’t articulate why sub-minute latency matters, batch is often enough. Another gap: late data in streaming is underplayed—events arrive out of order and after the window closes; you need a policy (allow lateness, side outputs, or drop) and to document it. Reconciliation between batch and streaming (e.g. “batch is source of truth, streaming is for real-time view”) is rarely discussed but matters when you run both.
Decision Framework
If latency of hours is acceptable and the workload is reports/ETL → Batch; schedule and run; keep it simple.
If latency of seconds/minutes is required (dashboards, alerts) → Streaming; design event flow, watermarks, and late-data policy.
If both completeness and real-time matter → Hybrid: batch for authoritative aggregates, streaming for real-time view; reconcile periodically.
For late data → Define watermark and lateness policy; use side outputs or allow lateness windows and document the semantics.
For production → Monitor backpressure, lag, and pipeline health; have runbooks for replay and failure.
Key Takeaways
Batch = scheduled, higher latency, simpler; streaming = continuous, lower latency, more complex. Choose by latency requirement and ops capacity.
Don’t default to streaming; batch is often sufficient and cheaper to run.
Late data in streaming needs a clear policy (watermarks, allow lateness, side outputs).
Hybrid (batch + streaming) is valid when you need both correctness and real-time; design reconciliation.
Define “real-time” with a number (e.g. < 1 min) so you can design and measure.
When I Would Use This Again — and When I Wouldn’t
I’d use batch again for reports, ETL, and any workload where hourly or daily latency is fine and I want to minimise operational complexity. I’d use streaming again when the product needs near-real-time (e.g. alerts, live dashboards) and the team can own watermarks and late data. I wouldn’t choose streaming without a clear latency target; “as fast as possible” leads to over-engineering. I also wouldn’t run streaming and batch without a defined reconciliation or source-of-truth story—otherwise the two pipelines drift and no one knows which to trust.
Frequently Asked Questions
Frequently Asked Questions
What is batch processing?
Batch processing runs on a schedule (hourly, daily) and processes data in windows. Use when latency of hours is acceptable.
What is streaming processing?
Streaming processing runs continuously and processes events as they arrive. Use when latency of seconds or minutes is required.
When should I use batch vs streaming?
Use batch for reports, ETL, and analytics where latency of hours is fine. Use streaming for real-time dashboards, alerts, and event-driven actions.
What is a watermark?
A watermark tracks how far processing has progressed. In batch, it marks the last processed timestamp. In streaming, it declares “all events before X have arrived.”
What is event time vs processing time?
Event time is when the event occurred. Processing time is when it was processed. Use event time for accurate analytics; processing time can vary.
What is windowing?
Windowing groups events by time. Types: tumbling (fixed, non-overlapping), sliding (overlapping), session (activity-based).
How do I handle late data?
Use watermarks and allowed lateness. Accept events up to N minutes late. For corrections, re-run batch or use append-only with latest wins.
What is Lambda architecture?
Lambda combines batch (accurate) and streaming (fast). Serving layer queries both. Complex but handles both accuracy and real-time.
What is Kappa architecture?
Kappa uses streaming only. Replay from event store when needed. Simpler than Lambda; works when streaming can handle all needs.
What Azure services are for batch?
Data Factory, Synapse Pipelines, Databricks, Azure Batch. Use for ETL, data warehouse loads, and heavy compute.