Waqas Ahmad — Software Architect & Technical Consultant - Available USA, Europe, Global

Waqas Ahmad — Software Architect & Technical Consultant

Specializing in

Distributed Systems

.NET ArchitectureCloud-Native ArchitectureAzure Cloud EngineeringAPI ArchitectureMicroservices ArchitectureEvent-Driven ArchitectureDatabase Design & Optimization

👋 Hi, I'm Waqas — a Software Architect and Technical Consultant specializing in .NET, Azure, microservices, and API-first system design..
I help companies build reliable, maintainable, and high-performance backend platforms that scale.

Experienced across engineering ecosystems shaped by Microsoft, the Cloud Native Computing Foundation, and the Apache Software Foundation.

Available for remote consulting (USA, Europe, Global) — flexible across EST, PST, GMT & CET.

services
Article

Event-Driven Architecture on Azure

Event Grid, Service Bus, Event Hubs: when to use which, failure, and replay.

services
Read the article

Introduction

This guidance is relevant when the topic of this article applies to your system or design choices; it breaks down when constraints or context differ. I’ve applied it in real projects and refined the takeaways over time (as of 2026).

Event-driven systems on Azure need the right service—Service Bus, Event Grid, or Event Hubs—and design for at-least-once delivery, idempotency, and dead-letter handling, or production reliability suffers. This article is a full guide to event-driven architecture on Azure: what each service is and when to use it, how to implement producers and consumers, and how to harden for failure and recovery. For architects and tech leads, picking the wrong service or ignoring idempotency and DLQ leads to duplicate processing and hard-to-debug failures; the patterns below avoid those pitfalls.

If you are new to event-driven architecture on Azure, start with Topics covered and Event-driven on Azure at a glance.

For a deeper overview of this topic, explore the full Event-Driven Architecture guide.

Decision Context

  • System scale: Varies by context; the approach in this article applies to the scales and scenarios described in the body.
  • Team size: Typically small to medium teams; ownership and clarity matter more than headcount.
  • Time / budget pressure: Applicable under delivery pressure; I’ve used it in both greenfield and incremental refactors.
  • Technical constraints: .NET and related stack where relevant; constraints are noted in the article where they affect the approach.
  • Non-goals: This article does not optimize for every possible scenario; boundaries are stated where they matter.

What is event-driven architecture and why it matters

Event-driven architecture is an approach where components of a system communicate by producing and consuming events—discrete things that happened (e.g. “OrderPlaced”, “BlobCreated”, “DeviceTelemetry”). Producers publish events without knowing who will consume them; consumers subscribe and react. This decouples services: the order service does not call the inventory or notification service directly; it publishes OrderPlaced and any number of subscribers can react. Benefits include scalability (add consumers without changing producers), resilience (if a consumer is down, events can be queued and processed later where the broker supports it), and flexibility (new subscribers can be added without changing existing ones).

On Azure, the three main services are Service Bus (reliable queues and topics for your own services), Event Grid (high-throughput event routing and Azure resource events with fan-out), and Event Hubs (high-throughput ingestion and stream processing). Each has a different throughput, delivery guarantee, and use case. Choosing the wrong one leads to cost, complexity, or reliability issues. This article explains when to use which and how to design for at-least-once delivery, idempotency, and replay so that your event-driven systems are production-ready.


Event-driven on Azure at a glance

Service What it is When to use
Azure Service Bus Queues (point-to-point) and topics (publish-subscribe with filters). Durable, ordered (with sessions), dead-letter, deferral. Reliable messaging between your services: order processing, workflow steps, any scenario where you cannot afford to lose a message and order may matter. Not for millions of events/sec.
Azure Event Grid Event routing: push-based, high throughput, deep Azure integration (blob created, resource changes, custom topics). At-least-once; subscribers must be idempotent. Reacting to events with fan-out to many subscribers; Azure resource events; notifications; low-latency event-driven workflows. Does not queue for long-term processing.
Azure Event Hubs High-throughput ingestion: millions of events per second; consumer groups (e.g. hot path vs cold analytics). Ingesting telemetry, logs, and streams; processing with Stream Analytics, Spark, or your own consumers. Not a general-purpose queue.
Loading diagram…

Azure Service Bus: queues and topics

Azure Service Bus provides queues (point-to-point: one producer, one consumer pool) and topics (publish-subscribe: one producer, many subscriptions with optional filters). Messages are durable, ordered within a session if you use sessions, and support dead-lettering and deferral. Use Service Bus when you need reliable messaging between your own services—order processing, workflow steps, or any scenario where you cannot afford to lose a message and where processing order may matter. It is not designed for millions of events per second; it is designed for consistency and durability.

Step 1: Send a message to a queue

// Producer – send to Service Bus queue
using Azure.Messaging.ServiceBus;

var client = new ServiceBusClient(connectionString);
var sender = client.CreateSender("order-queue");

var message = new ServiceBusMessage(JsonSerializer.Serialize(new OrderPlacedEvent
{
    OrderId = orderId,
    IdempotencyId = Guid.NewGuid().ToString(),
    Timestamp = DateTime.UtcNow
}))
{
    MessageId = idempotencyId,
    CorrelationId = correlationId
};
await sender.SendMessageAsync(message);
await sender.DisposeAsync();
await client.DisposeAsync();

What this does: Creates a Service Bus client and sender for the queue order-queue. Serializes an OrderPlacedEvent (including an idempotency ID and correlation ID) and sends it. MessageId and CorrelationId are set so that consumers can deduplicate and trace the request.

Step 2: Receive and process messages

// Consumer – receive from Service Bus queue
var client = new ServiceBusClient(connectionString);
var processor = client.CreateProcessor("order-queue", new ServiceBusProcessorOptions
{
    MaxConcurrentCalls = 4,
    AutoCompleteMessages = false
});

processor.ProcessMessageAsync += async args =>
{
    var body = args.Message.Body.ToString();
    var evt = JsonSerializer.Deserialize<OrderPlacedEvent>(body);
    try
    {
        await HandleOrderPlacedAsync(evt, args.CancellationToken);
        await args.CompleteMessageAsync(args.Message);
    }
    catch (Exception)
    {
        await args.AbandonMessageAsync(args.Message); // retry, or eventually DLQ
    }
};
processor.ProcessErrorAsync += args => { /* log */ return Task.CompletedTask; };
await processor.StartProcessingAsync();

What this does: Creates a processor that receives messages from order-queue. For each message, it deserializes the event, calls HandleOrderPlacedAsync, and completes the message on success or abandons it so that Service Bus can retry (and eventually move to the dead-letter queue if max delivery count is exceeded). AutoCompleteMessages = false so that you complete only after successful processing.

Step 3: Configure queue and dead-letter in Azure

Create the queue in Azure (Portal, ARM, or Bicep) with MaxDeliveryCount (e.g. 10) so that failed messages move to the dead-letter queue instead of being dropped. Enable sessions if you need FIFO per session key.

How this fits together: Producers send messages with correlation ID and idempotency key; consumers process them and complete or abandon. Failed messages are retried and eventually land in the DLQ for inspection and resubmit. Use Application Insights and correlation ID to trace a single business transaction across producers and consumers.


Azure Event Grid: event routing and fan-out

Azure Event Grid is an event routing service: high throughput, push-based delivery, and deep integration with Azure (e.g. blob created, resource group changes, custom topics). Use Event Grid when you are reacting to events and need fan-out to many subscribers with low latency. It is at-least-once delivery; subscribers must be idempotent. Event Grid does not queue for long-term processing—it pushes and retries with backoff. Ideal for notifications, integrations, and event-driven workflows that react quickly.

Publish to a custom topic

// Publisher – Event Grid custom topic
using Azure.Messaging.EventGrid;
using Azure.Messaging.EventGrid.SystemEvents;

var client = new EventGridPublisherClient(
    new Uri(topicEndpoint),
    new AzureKeyCredential(topicKey));

var evt = new EventGridEvent(
    subject: "orders/placed",
    eventType: "OrderPlaced",
    dataVersion: "1.0",
    data: new BinaryData(JsonSerializer.Serialize(new OrderPlacedEvent { OrderId = orderId })))
{
    Id = idempotencyId,
    EventTime = DateTimeOffset.UtcNow
};
await client.SendEventAsync(evt);

What this does: Sends a custom event to an Event Grid topic. Subscribers (webhooks, Azure Functions, Logic Apps, etc.) receive the event. Use Id (idempotency) and subject / eventType for filtering and routing.

Subscribe with Azure Functions

// Function – Event Grid trigger
[FunctionName("OnOrderPlaced")]
public static async Task Run(
    [EventGridTrigger] EventGridEvent evt,
    ILogger log)
{
    log.LogInformation("Event: {Subject} {Id}", evt.Subject, evt.Id);
    var data = JsonSerializer.Deserialize<OrderPlacedEvent>(evt.Data.ToString());
    await HandleOrderPlacedAsync(data);
}

What this does: An Azure Function triggered by Event Grid receives the event. The function must be idempotent (same event may be delivered more than once). Use evt.Id as idempotency key.

How this fits together: Publishers send events to a topic; Event Grid delivers to all subscriptions (webhook, Function, etc.). Subscribers process and return 2xx; Event Grid retries on failure. No long-term queue—delivery is push-based with retries.


Azure Event Hubs: ingestion and stream processing

Azure Event Hubs is for high-throughput ingestion: telemetry, logs, and stream processing. It accepts millions of events per second; you consume via consumer groups (e.g. one group for hot path, another for cold analytics). Use Event Hubs when you are ingesting large volumes of data and processing with Stream Analytics, Spark, or your own consumers. It is not a general-purpose queue; use Service Bus for that.

Send events to Event Hubs

// Producer – Event Hubs
using Azure.Messaging.EventHubs;
using Azure.Messaging.EventHubs.Producer;

var producer = new EventHubProducerClient(connectionString, eventHubName);
var batch = await producer.CreateBatchAsync();
batch.TryAdd(new EventData(JsonSerializer.Serialize(telemetryEvent)));
await producer.SendAsync(batch);
await producer.DisposeAsync();

What this does: Creates an Event Hubs producer and sends a batch of events. Event Hubs is optimized for throughput; use batching and multiple partitions for scale.

Consume with a processor

// Consumer – Event Hubs processor
var processor = new EventProcessorClient(
    blobContainerClient, // checkpoint store
    consumerGroup,
    connectionString,
    eventHubName);
processor.ProcessEventAsync += async args =>
{
    var body = args.Data.Body.ToString();
    await ProcessTelemetryAsync(body);
};
await processor.StartProcessingAsync();

What this does: EventProcessorClient reads from a consumer group and checkpoints progress in blob storage. Each partition is processed by one processor instance; scale out by adding instances. Use consumer groups to have multiple independent consumers (e.g. real-time alerts vs batch analytics).

How this fits together: Producers send high-volume events to Event Hubs; consumers in one or more consumer groups process them. Stream Analytics, Spark, or custom processors read from a consumer group. Retention allows replay by resetting consumer offset. Do not use Event Hubs as a queue for transactional workflows—use Service Bus for that.


Designing for at-least-once delivery and idempotency

Both Service Bus and Event Grid deliver at least once. Your handler may receive the same message or event more than once (e.g. after a retry or replay). Your processing logic must be idempotent: processing the same message twice should not double-charge a customer or create duplicate records. Use a correlation ID or idempotency key (stored in cache or database) to detect and skip duplicates.

Idempotent handler pattern

// Idempotent handler – check before process
public async Task HandleOrderPlacedAsync(OrderPlacedEvent evt, CancellationToken ct)
{
    if (await _processedIds.ExistsAsync(evt.IdempotencyId, ct))
        return; // already processed
    await _processedIds.AddAsync(evt.IdempotencyId, ct);

    await _orderService.CreateOrderAsync(evt.OrderId, evt.Payload, ct);
}

What this does: Before processing, check whether IdempotencyId has already been processed (e.g. in a cache or table). If yes, return without side effects. If no, add the key and then perform the operation. For updates, use upsert (insert or update by key) so that replay overwrites instead of duplicating.

How this fits together: Every event carries an idempotency key; handlers record processed keys and skip duplicates. Combined with correlation ID in headers and logs, you can trace a business transaction end-to-end and avoid duplicate side effects on retry or replay.


Dead-letter queues and monitoring

Configure dead-letter queues (DLQ) on Service Bus queues and subscriptions so that messages that fail after max delivery count are moved there instead of dropped. Process the DLQ periodically: fix the bug or data, then resubmit to the main queue or abandon. Use Application Insights or your logging stack to alert on DLQ depth and processing failures. Correlate requests with trace IDs across services so you can follow a single business transaction through the event pipeline.

Service Bus: Set MaxDeliveryCount on the queue or subscription; failed messages move to the dead-letter subqueue. Use MessageReceiver to read from the DLQ and resubmit or abandon. Event Grid: Subscribers return 2xx on success; Event Grid retries with backoff on failure. There is no built-in DLQ; ensure your endpoint is idempotent and log failures for manual or automated retry. Event Hubs: No DLQ; consumers checkpoint progress. Failed processing is handled in your code (retry, skip, or write to a separate store for later replay).


Best practices and common issues

Do: Match the service to the pattern—Service Bus for queues/topics, Event Grid for event routing, Event Hubs for ingestion and streaming. Design for idempotency (idempotency key, skip or upsert). Propagate correlation ID in message headers and logs. Configure DLQ on Service Bus and monitor DLQ depth. Use async (Service Bus, Event Grid) for cross-service communication; use sync only at the edge or within a single service.

Don’t: Use Service Bus for ingestion at millions/sec, or Event Hubs as a general queue—leads to cost and complexity. Non-idempotent handlers lead to duplicate processing. Ignoring the DLQ means failed messages are never fixed. Missing correlation IDs make debugging and tracing impossible.

Common issues:

  • Using the wrong service: Service Bus for millions/sec ingestion, or Event Hubs as a general queue. Fix: Use Service Bus for reliable queues/topics; Event Grid for event routing; Event Hubs for ingestion and stream processing.
  • Non-idempotent handlers: Processing the same message twice double-charges or duplicates records. Fix: Use idempotency key and store processed keys; skip or upsert on replay.
  • Ignoring dead-letter queues: Failed messages sit in the DLQ; no one processes them. Fix: Monitor DLQ depth; process (fix and resubmit or abandon) periodically; alert when depth exceeds a threshold.
  • Missing correlation IDs: Cannot trace a business transaction end-to-end. Fix: Propagate correlation ID in message headers and logs; use Application Insights or OpenTelemetry with distributed tracing.
  • Over-relying on sync: Mixing sync REST between services with events increases coupling and latency. Fix: Prefer async (Service Bus, Event Grid) for cross-service communication.

Summary

Event-driven architecture on Azure uses Service Bus (reliable queues/topics), Event Grid (event routing, fan-out), and Event Hubs (ingestion, stream processing)—choose the right service for the pattern. Wrong service choice or non-idempotent handlers cause duplicate work and production incidents; designing for at-least-once, idempotency keys, and DLQ monitoring keeps systems recoverable. Next, map your use case to Service Bus (workflows, ordered work), Event Grid (Azure events, fan-out), or Event Hubs (telemetry, streams), then add idempotency and correlation to every handler.

  • Event-driven architecture on Azure uses Service Bus (reliable queues/topics), Event Grid (event routing, fan-out), and Event Hubs (ingestion, stream processing). Choose the right service for the pattern.
  • Service Bus: Reliable, ordered messaging between your services; queues and topics; dead-letter; use for order processing, workflows.
  • Event Grid: High-throughput event routing; Azure resource events; fan-out to many subscribers; push-based; use for notifications, integrations, low-latency reactions.
  • Event Hubs: High-throughput ingestion; consumer groups; use for telemetry, logs, stream processing (Stream Analytics, Spark).
  • Design for at-least-once: Idempotent handlers (idempotency key, skip or upsert); correlation ID for tracing; dead-letter queues and monitoring so that failures are visible and recoverable. Avoid wrong service choice, non-idempotent handlers, and ignoring the DLQ.

Position & Rationale

I choose Service Bus when I need reliable, ordered messaging between our own services—order processing, workflow steps—where we cannot afford to lose a message. I use Event Grid when we are reacting to events with fan-out (many subscribers) or integrating with Azure resource events; I do not use it as a long-term queue. I use Event Hubs only for high-throughput ingestion (telemetry, logs) and stream processing; I never use it as a general-purpose queue. I reject using Event Hubs for transactional workflows—throughput and partitioning are wrong for that. I insist on idempotency keys and correlation IDs on every event and message; without them, debugging and duplicate handling are unsustainable. I avoid fire-and-forget publishing without at-least-once semantics and a clear DLQ story for Service Bus.


Trade-Offs & Failure Modes

  • What this sacrifices: Event-driven systems trade synchronous consistency for eventual consistency; you give up simple request–response debugging and must design for replay and duplicates.
  • Where it degrades: Wrong service choice (e.g. Event Hubs as a queue) leads to cost and complexity. Non-idempotent handlers cause duplicate charges or duplicate records under retry. Ignoring the DLQ means failed messages are never fixed.
  • How it fails when misapplied: Using Service Bus for millions of events per second, or Event Hubs for ordered workflow steps. Missing correlation IDs make cross-service tracing impossible. Sync coupling between services alongside events increases latency and coupling.
  • Early warning signs: “We can’t trace a single order across services”; DLQ depth growing without a process to fix; handlers that assume exactly-once delivery.

What Most Guides Miss

Most guides show publishing and subscribing but skip idempotency and DLQ handling. In production, at-least-once delivery means every handler will see duplicates; without an idempotency key and a store of processed IDs (or upsert semantics), you get duplicate side effects. Few tutorials emphasise monitoring the DLQ and having a clear process to fix, resubmit, or abandon—so teams discover too late that failed messages pile up. Correlation ID propagation in headers and logs is also underplayed; without it, debugging a single business transaction across Event Grid, Service Bus, and multiple handlers is painful.


Decision Framework

  • If you need reliable messaging between your services (orders, workflows) → Use Service Bus (queue or topic); set MaxDeliveryCount and use the DLQ; use idempotency keys and correlation ID.
  • If you need fan-out to many subscribers or Azure resource events → Use Event Grid; make handlers idempotent; do not use it as a long-term queue.
  • If you need to ingest telemetry or logs at high throughput → Use Event Hubs with consumer groups; use Stream Analytics or your own processors; do not use it as a general queue.
  • If you are unsure → Match the service to the pattern (messaging vs routing vs ingestion); avoid using one service for another’s job.
  • If handlers are not idempotent → Add idempotency key and processed-key check (or upsert) before any side effect; treat this as non-negotiable.

You can also explore more patterns in the Event-Driven Architecture resource page.

Key Takeaways

  • Service Bus for reliable queues/topics; Event Grid for event routing and fan-out; Event Hubs for ingestion and streaming—do not swap their roles.
  • Design every handler for at-least-once: idempotency key, skip or upsert on replay, correlation ID for tracing.
  • Configure and monitor the DLQ on Service Bus; have a process to fix and resubmit or abandon.
  • Revisit service choice when requirements change (e.g. scale, ordering, or consistency needs).

If you’re adopting event-driven architecture, I help teams design messaging workflows, pub/sub models, and scalable distributed event systems.

When I Would Use This Again — and When I Wouldn’t

I would use Service Bus again for any system that needs reliable, ordered messaging between our services and can invest in idempotency and DLQ handling. I would use Event Grid again for Azure resource events and custom event routing with fan-out. I would use Event Hubs again for telemetry and stream processing. I wouldn’t use Event Hubs as a queue, or Service Bus for millions-of-events-per-second ingestion—wrong tool, wrong cost and behaviour. I wouldn’t go event-driven if the team cannot own idempotency, correlation, and DLQ monitoring; sync APIs or a simple queue may be better. If the domain does not need decoupling or replay, a straightforward request–response API is simpler.


services
Frequently Asked Questions

Frequently Asked Questions

When should I use Service Bus vs Event Grid vs Event Hubs?

Service Bus: Reliable queues and topics between your services; ordered, durable; dead-letter. Event Grid: High-throughput event routing, Azure resource events, fan-out to many subscribers. Event Hubs: High-throughput ingestion (telemetry, logs) and stream processing. Do not use Event Hubs as a general queue.

How do I make my event handler idempotent?

Use an idempotency key (e.g. in the message) and store processed keys (cache or DB); if the key exists, skip or return success. For updates, use upsert (insert or update by key) so that replay does not duplicate. Design so that processing the same message twice has the same effect as once.

What happens when a message fails in Service Bus?

After max delivery count (configurable), the message is moved to the dead-letter queue (DLQ). Process the DLQ: fix the bug or data, then resubmit to the main queue or abandon. Alert on DLQ depth so that failures are visible.

How do I trace a request across event-driven services?

Propagate correlation ID (and trace ID) in message headers; each service logs the correlation ID. Use Application Insights or OpenTelemetry with distributed tracing so that you can follow a single business transaction across Service Bus, Event Grid, and your handlers.

What are common mistakes with event-driven architecture on Azure?

Wrong service choice (e.g. Event Hubs as a queue). Non-idempotent handlers leading to duplicate processing. Ignoring DLQ so that failed messages are never fixed. Sync coupling between services instead of async. Missing correlation so that debugging is impossible.

How do I handle late-arriving or out-of-order events?

Design for eventual consistency. In stream processing, use watermarks or event time so that late data can be incorporated. For batch, use idempotent merge (e.g. upsert by key) so that late arrivals overwrite or append correctly.

When should I use Service Bus?

Use Service Bus when you need reliable messaging between your own services: order processing, workflow steps, or any scenario where you cannot afford to lose a message and where processing order may matter. Supports queues (point-to-point) and topics (publish-subscribe with filters); sessions for ordering; dead-letter and deferral.

When should I use Event Grid?

Use Event Grid when you are reacting to events and need fan-out to many subscribers with low latency; Azure resource events (blob created, resource changes); custom topics for your own events. Push-based delivery; at-least-once; subscribers must be idempotent. Not for long-term queuing.

When should I use Event Hubs?

Use Event Hubs when you are ingesting large volumes of data (telemetry, logs) and processing with Stream Analytics, Spark, or your own consumers. Millions of events per second; consumer groups for multiple independent consumers. Not a general-purpose queue.

What is at-least-once delivery?

At-least-once means the message or event may be delivered more than once (e.g. after a retry or replay). Your handler must be idempotent: processing the same message twice must not cause duplicate side effects (e.g. double charge, duplicate record). Use idempotency key and skip or upsert on replay.

How do I monitor event-driven systems?

Use Application Insights (or similar) for logs and metrics; DLQ depth alerts on Service Bus; correlation ID and distributed tracing to follow a transaction across producers and consumers. Alert on processing failures and DLQ depth so that issues are visible.

What is a correlation ID?

A correlation ID is an identifier (e.g. GUID) that traces a single business transaction across services. Propagate it in message headers and logs so that when debugging, you can find all related logs and spans for that transaction. Use with Application Insights or OpenTelemetry for distributed tracing.

How do I handle failures and retries?

Service Bus: Configure max delivery count; failed messages go to DLQ. Process DLQ to fix and resubmit or abandon. Use AbandonMessageAsync in the handler to trigger retry. Event Grid: Subscriber returns 2xx on success; Event Grid retries with backoff on failure. Event Hubs: No built-in DLQ; implement retry or write failed events to a store for replay.

Event Grid vs webhooks?

Event Grid is a managed event routing service: retries, filters, Azure integration, and multiple subscribers. Webhooks are a simple HTTP callback—you implement the endpoint; no built-in retry or filtering. Event Grid is more robust for production event-driven workflows.

How do I replay events?

Event Hubs: Events are stored for retention (configurable). Replay by creating a new consumer group or resetting the consumer offset to an earlier position. Service Bus: No built-in replay; use deferral or schedule messages for later. For replay of past events, store events in Event Hubs or a store and replay by re-reading.

services
Related Guides & Resources

services
Related services