👋Hi, I'm Waqas — a Software Architect and Technical Consultant specializing in .NET, Azure, microservices, and API-first system design..
I help companies build reliable, maintainable, and high-performance backend platforms that scale.
Azure microservices: AKS vs Service Fabric, event-driven design, health checks, and cost.
April 12, 2024 · Waqas Ahmad
Read the article
Introduction
This guidance is relevant when the topic of this article applies to your system or design choices; it breaks down when constraints or context differ. I’ve applied it in real projects and refined the takeaways over time (as of 2026).
Building microservices on Azure forces choices about orchestration (AKS vs Service Fabric), messaging (Service Bus, Event Grid), and cost—choices that affect scalability, operational complexity, and maintainability. This article explains when to choose AKS vs Service Fabric, how to design event-driven communication, how to implement health checks and readiness, and how to control cost without sacrificing reliability, with concrete code (Program.cs, Dockerfile, appsettings). Getting these decisions right matters for architects and tech leads who need consistent resilience, observability, and API design across services.
If you are new to Azure microservices, start with Topics covered and Azure microservices at a glance. We explain AKS, Service Fabric, event-driven patterns, health checks, and cost with tables, diagrams, and code.
System scale: Multiple services (typically 3+) on Azure; from a handful to dozens; applies when you’re building or operating microservices and need consistency in APIs, resilience, and observability.
Team size: One team per service or a small number of services; platform may own gateway, messaging, and observability; delivery teams own their services.
Time / budget pressure: Fits greenfield and incremental decomposition; breaks down when “we’ll add resilience later” and never do—then production bites.
Technical constraints: Azure (App Service, AKS, Service Bus, Event Grid, API Management, etc.); .NET typical; assumes you can add circuit breaker, retry, and tracing.
Non-goals: This article does not optimize for monoliths or for “microservices at any cost”; it optimises for consistent, resilient service design when you’ve already chosen a multi-service architecture.
What are microservices and why they matter
Microservices are an architectural style where you build a system as a set of small, independently deployable services. Each service owns a bounded piece of business capability (e.g. orders, billing, notifications) and communicates with others via APIs or messages. Unlike a monolith (one big application and one database), you deploy and scale each service separately; teams can own and release their service without waiting for the whole system. That brings benefits—independent scaling, technology diversity, clearer ownership—but also complexity: distributed tracing, eventual consistency, and more moving parts.
On Azure, you run these services on containers (e.g. in AKS) or managed runtimes (e.g. Service Fabric, App Service), with messaging (Service Bus, Event Grid) and databases (Azure SQL, Cosmos DB) tying them together. Getting the orchestration, messaging, and operational choices right from the start avoids costly rework and keeps delivery fast. The rest of this article focuses on how to do it well on Azure.
What is Azure Kubernetes Service (AKS)?
Azure Kubernetes Service (AKS) is Microsoft’s managed Kubernetes offering. Kubernetes (K8s) is an open-source system for orchestrating containers: it schedules and runs your containers (e.g. Docker images of your .NET API) on a cluster of machines, restarts failed containers, scales them up or down, and handles load balancing and rolling updates. You describe what you want (e.g. “run 3 replicas of my API”) in manifests (YAML or Helm charts), and Kubernetes keeps the cluster in that state.
AKS runs the control plane for you; you get a cluster and add node pools (the VMs that run your workloads). You focus on your apps; Microsoft handles upgrades, security patches, and scaling of the control plane. AKS fits stateless services well: your API, workers, or frontend run in containers; state lives in databases or caches outside the pod. It is the default choice for most new microservices on Azure because of portability (same K8s elsewhere) and a large ecosystem (Helm, Kustomize, GitOps, Azure Monitor integration).
What is Azure Service Fabric?
Azure Service Fabric is a different platform: a distributed systems runtime from Microsoft that can run containers but also native .NET services (stateful or stateless). It is built for stateful scenarios: in-memory state, reliable collections (distributed key-value stores), actors (stateful objects with single-threaded access), and long-running workflows. Service Fabric handles replication, failover, and rolling upgrades for you.
If your domain has actors (e.g. user session, device state) or stateful workflows that benefit from colocating logic and state, Service Fabric can simplify your design. The trade-off is lock-in to Microsoft’s stack and a steeper learning curve for teams used to Kubernetes. It is often chosen when you already have Service Fabric workloads, need Windows containers, or want strong .NET integration without adopting Kubernetes.
Azure microservices at a glance
Concept
What it is
When to use
AKS
Managed Kubernetes; container orchestration
Stateless or state-externalised microservices; portability; default for greenfield
Stateful .NET services; Windows containers; existing SF workloads
Service Bus
Queues and topics; reliable, ordered messaging
Work between services; dead-letter; sessions
Event Grid
Event routing; push, high throughput
Fan-out; Azure resource events
Health checks
Liveness (am I up?) and readiness (can I take traffic?)
AKS/K8s probes; load balancer removal when not ready
Managed Identity
Azure AD identity for the app; no secrets in code
Auth to Key Vault, SQL, Service Bus
Loading diagram…
AKS vs Service Fabric in depth
AKS is the default choice for most new microservices. It is based on Kubernetes, uses open standards, and has a large ecosystem (Helm, Kustomize, GitOps). AKS fits well when your services are stateless or when state is externalised to databases and caches. You get portability: the same manifests can run on other Kubernetes offerings or on-prem. Teams that already know Kubernetes ramp up quickly, and Azure integration (Managed Identity, Key Vault, Monitor) is solid.
Service Fabric shines when you need stateful services (in-memory state, reliable collections), strong .NET integration, or Windows containers. It gives you a distributed runtime with built-in replication, failover, and rolling upgrades. If your domain has actors or long-running stateful workflows, Service Fabric can simplify your design. The trade-off is lock-in to Microsoft’s stack and a steeper learning curve for teams coming from Kubernetes.
Recommendation: Prefer AKS for greenfield, container-first microservices. Choose Service Fabric when you have existing Service Fabric workloads, need stateful .NET services, or require Windows containers for legacy components.
Event-driven communication: Service Bus and Event Grid
Prefer asynchronous messaging between microservices so that availability and latency of one service do not cascade. On Azure, the main options are Azure Service Bus (queues and topics) and Azure Event Grid (event routing).
Service Bus gives you queues (point-to-point) and topics (publish-subscribe with filters). Messages are reliable, ordered (with sessions), and support dead-letter for failed processing. Use Service Bus when work must be processed exactly once or with explicit retries, or when you need sessions (e.g. per-user ordering). Use Managed Identity so no connection strings are in code.
Event Grid is high-throughput, push-based event delivery. It fits fan-out (one event to many subscribers) and Azure resource events (e.g. blob created, resource updated). It is at-least-once and does not replace a queue for ordered, exactly-once work. Use Event Grid when you need event routing at scale or integration with Azure services.
Loading diagram…
Health checks and readiness
In AKS, liveness and readiness probes determine whether a pod is kept running and whether it receives traffic. A liveness probe that fails causes the pod to be restarted; a readiness probe that fails removes the pod from the Service’s endpoints so it no longer receives requests (e.g. during startup or when a dependency is down).
Implement a health endpoint in your .NET API that checks dependencies (database, message bus). Use ASP.NET Core Health Checks: separate liveness (minimal: “process is up”) from readiness (dependencies OK). Expose them on different paths (e.g. /health/live and /health/ready) and point Kubernetes probes at them. That way, the orchestrator does not kill the pod when the database is temporarily slow, but it does stop sending traffic until the service is ready.
Cost optimization
Main cost drivers for Azure microservices: compute (AKS node pools, Service Fabric VMs), messaging (Service Bus, Event Grid), data (Azure SQL, Cosmos DB), and egress. To keep cost under control:
Right-size node pools: Start with the smallest node SKU that meets your resource requests; use scale-in and scale-out (or cluster autoscaler) for variable load.
Use Managed Identity: Avoid storing connection strings and keys; use Key Vault references and Managed Identity so you do not pay for extra secret management and reduce risk.
Reserved capacity: For baseline load, reserved instances or Savings Plans reduce compute cost.
Tag everything: Tag resources by team, environment, and project so you can attribute cost and set budgets.
Review messaging: Service Bus pricing is per operation and per topic/queue; consolidate or archive old topics. Event Grid is per event; avoid fan-out explosion if cost is a concern.
What this does:/health/live returns 200 with no checks (liveness = “process is up”). /health/ready runs all registered checks (readiness). In AKS, point liveness at /health/live and readiness at /health/ready.
Step 2: Add dependency checks (readiness)
// OrderService/Program.cs — add DbContext and Service Bus check
builder.Services.AddDbContext<OrderDbContext>(options =>
options.UseSqlServer(builder.Configuration.GetConnectionString("Orders")));
builder.Services.AddSingleton<ServiceBusHealthCheck>();
builder.Services.AddHealthChecks()
.AddDbContextCheck<OrderDbContext>("db")
.AddCheck<ServiceBusHealthCheck>("servicebus");
What this does: Readiness now fails if the database or Service Bus is unreachable, so the pod is removed from the load balancer until dependencies are back.
How this fits together: The API uses IOrderMessagePublisher to publish events; the implementation uses Azure.Messaging.ServiceBus with a topic. Register ServiceBusClient with Managed Identity in production so no connection string is stored. Health check can ping the namespace or send a probe message to confirm connectivity.
Dockerfile and appsettings
Dockerfile: Multi-stage build keeps the image small. Use a non-root user and expose the port your app listens on.
appsettings: Keep secrets out of config. Use environment variables or Key Vault references (e.g. App Service Key Vault references, or AKS secrets / external-secrets). Example structure:
In production, replace connection strings with Managed Identity and Key Vault; Namespace can stay in config if you use Managed Identity for auth.
Class structure: how the pieces fit together
An Azure microservices solution typically involves: the API edge, backend services, messaging abstractions, and health/observability. Each service exposes HTTP or gRPC and depends on interfaces for messaging and persistence so you can swap implementations (e.g. Service Bus vs Event Grid) without changing business logic.
Loading diagram…
Common issues and challenges
Synchronous chains: Calling multiple services over HTTP in a chain increases latency and couples availability. Prefer async messaging for cross-service work.
No health/readiness split: Using a single health endpoint that checks dependencies can cause the orchestrator to restart the pod when the DB is slow. Split liveness (minimal) and readiness (dependencies).
Secrets in config: Connection strings in appsettings or env vars are a security and rotation burden. Use Managed Identity and Key Vault.
Over-partitioning: Too many microservices too early increases operational and network cost. Start with a small number of bounded contexts and split when ownership or scaling justifies it.
Ignoring cost: Unbounded node pools, unused topics, or large Cosmos/SQL tiers add up. Tag, budget, and right-size from day one.
Best practices and pitfalls
Do:
Prefer AKS for greenfield microservices unless you have a strong reason for Service Fabric.
Use Managed Identity for all Azure resource auth (Key Vault, SQL, Service Bus).
Implement liveness and readiness and wire them to K8s probes.
Use async messaging (Service Bus or Event Grid) for cross-service communication.
Tag resources and set budgets and alerts.
Keep ADRs (Architecture Decision Records) for why you chose AKS vs Service Fabric, or Service Bus vs Event Grid.
Do not:
Store connection strings or keys in code or config when Managed Identity is available.
Use a single health check that fails on dependency issues for liveness (or the orchestrator will restart the pod).
Call many services synchronously in a chain; use events or queues.
Split into dozens of services before you have clear ownership and release needs.
Skip monitoring and distributed tracing; use correlation IDs and Azure Monitor / Application Insights.
Summary
AKS is the default for new Azure microservices; Service Fabric fits stateful .NET, actors, or existing SF workloads—use Service Bus for reliable work between services and Event Grid for fan-out, and implement liveness and readiness with dependencies. Getting orchestration, messaging, and observability wrong leads to production incidents and rework; designing for failure and tracing from day one keeps systems reliable. Next, map your bounded contexts and deployment targets, then choose AKS or Service Fabric and add health checks and correlation IDs before scaling out.
AKS is the default for new Azure microservices; Service Fabric fits stateful .NET, actors, or existing SF workloads.
Use Service Bus for reliable, ordered work between services; Event Grid for fan-out and Azure events.
Implement liveness (minimal) and readiness (with dependencies) and map them to AKS probes.
Managed Identity and Key Vault keep secrets out of code; tag and budget to control cost.
Structure services with interfaces for messaging and persistence; use Dockerfile multi-stage builds and appsettings without secrets in repo.
Avoid synchronous chains, over-partitioning, and skipping health checks or observability.
Position & Rationale
I favour API-first and contracts (OpenAPI, versioning) so services don’t break each other; resilience (circuit breaker, retry with backoff) so one failing service doesn’t cascade. I use messaging (Service Bus, Event Grid) for async and distributed tracing (e.g. Application Insights, W3C) so we can follow a request across services. I avoid shared databases between services; each service owns its data and exposes an API. I also avoid “we’ll add observability later”—correlation IDs and health checks from day one.
Trade-Offs & Failure Modes
What this sacrifices: Operational complexity (many services, many deploys, many failure modes); you accept eventual consistency and network failures as normal.
Where it degrades: When services are too fine-grained (network hop for every operation) or when nobody owns cross-cutting concerns (auth, tracing, gateway).
How it fails when misapplied: No circuit breaker so one slow dependency takes down the service; or no idempotency so retries duplicate side effects.
Early warning signs: “We don’t know which service is slow”; “our gateway is a single point of failure”; “we have no distributed tracing.”
What Most Guides Miss
Most guides list “best practices” without who owns what. Gateway, messaging, and tracing are often platform concerns; service teams own their API and resilience. If that split is unclear, you get gaps. The other gap: contract testing—services that consume others should have contract tests (e.g. Pact) so breaking changes are caught before deploy. Finally: idempotency—when you retry or replay messages, handlers must be idempotent or you get duplicate orders, double charges, etc.; many guides mention retry but not idempotency.
Decision Framework
If you’re adding a new service → Define API (OpenAPI), version it, add health and readiness; use circuit breaker and retry for outbound calls.
If you’re integrating services → Prefer async (messaging) for fire-and-forget; sync (HTTP) when you need an immediate response; ensure idempotency for retries.
If you have no distributed tracing → Add correlation IDs (W3C trace context) and send them to Application Insights or similar; start with one service and expand.
If the gateway is a bottleneck or single point of failure → Scale it, add health checks, and consider multi-region if needed.
If services share a database → Plan to split; shared DB creates coupling and blocks independent deploy.
One service per bounded context; each owns its data and exposes an API; no shared database.
Resilience: circuit breaker and retry with backoff for outbound calls; design for failure.
Observability: correlation IDs and distributed tracing from day one; health checks for every service.
Contracts and versioning (OpenAPI, URL or header versioning) so consumers don’t break.
Idempotency for message handlers and retried operations so retries don’t duplicate side effects.
Need help designing resilient microservices? I support teams with domain boundaries, service decomposition, and distributed systems architecture.
When I Would Use This Again — and When I Wouldn’t
I would use these practices again when I’m building or operating microservices on Azure and need consistent resilience, observability, and API design. I wouldn’t use them for a monolith—then focus on modular monolith and in-process boundaries first. I also wouldn’t skip resilience or tracing “to ship faster”; production incidents cost more. Alternative: if you’re decomposing a monolith, introduce circuit breaker and tracing for the first extracted service and then apply the same pattern as you split further.
Frequently Asked Questions
Frequently Asked Questions
When should I use AKS vs Service Fabric?
Use AKS for greenfield, container-first microservices when your services are stateless or state lives in databases/caches. Use Service Fabric when you have existing Service Fabric workloads, need stateful .NET services (reliable collections, actors), or require Windows containers. AKS gives portability and a large ecosystem; Service Fabric gives strong .NET and stateful primitives.
What is the difference between liveness and readiness?
Liveness answers “is the process alive?”—if it fails, the orchestrator restarts the pod. Readiness answers “can this instance take traffic?”—if it fails, the pod is removed from the load balancer. Use a minimal liveness check (or none) and put dependency checks (DB, message bus) in readiness so the pod is not restarted when a dependency is temporarily down.
Should I use synchronous REST or messaging between microservices?
Prefer asynchronous messaging (Service Bus, Event Grid) for cross-service communication so that availability and latency of one service do not cascade. Use synchronous REST only within a single service boundary or at the edge (e.g. API Gateway to one backend for a request-response flow).
How do I secure Service Bus and avoid connection strings?
Use Managed Identity for producers and consumers: enable system- or user-assigned identity on your App Service or AKS pod identity, and grant the identity Azure Service Bus Data Sender/Receiver (or similar) on the namespace. In code, use DefaultAzureCredential or ManagedIdentityCredential; do not put connection strings in config.
What are the main cost drivers for microservices on Azure?
Compute (AKS node pools or Service Fabric VMs), messaging (Service Bus, Event Grid), data (Azure SQL, Cosmos DB), and egress. Right-size node pools, use reserved capacity for baseline load, tag resources, and set budgets. Review messaging usage (per-operation cost) and storage tiers regularly.
How do I implement health checks in .NET for AKS?
Use ASP.NET Core Health Checks: register AddHealthChecks(), add AddDbContextCheck and custom checks for Service Bus or other dependencies. Map liveness to a path with Predicate = _ => false (no checks) and readiness to a path that runs all checks. In Kubernetes, set livenessProbe and readinessProbe to hit those URLs.
When should I use Event Grid vs Service Bus?
Use Event Grid for high-throughput, push-based event delivery and Azure resource events (e.g. blob created). Use Service Bus for reliable, ordered processing with dead-letter and sessions when work must be processed exactly once or with explicit retries. Event Grid is at-least-once and fan-out; Service Bus is for work queues and ordered processing.
How many microservices should I start with?
Start with a small number (e.g. 2–5) and split only when you have clear ownership, independent release needs, or scaling/resilience requirements that justify the cost. Avoid splitting by technical layer; split by bounded context.
What is Managed Identity and why use it for microservices?
Managed Identity lets Azure resources (App Service, AKS pods, Functions) authenticate to other Azure services (Key Vault, SQL, Service Bus) without storing secrets. Use it for all service-to-service auth so that no connection strings or keys are in code or config; rotation is handled by Azure.
How do I run microservices locally?
Run dependencies in Docker (e.g. Azurite for storage, local emulators where available) or use stub implementations. Integration tests can hit a real Service Bus namespace in a dev subscription. For full local stacks, consider Tye or Docker Compose to orchestrate multiple services.
What is the role of correlation ID in microservices?
A correlation ID (or trace ID) is passed in headers across all services involved in a request. It lets you search logs and traces for every log line related to that request, making debugging and observability possible across service boundaries. Use it in middleware and when publishing/consuming messages.
When should I use Service Fabric actors?
Use Service Fabric actors when you have stateful, per-entity logic (e.g. user session, device state, workflow per order) that benefits from colocated state and single-threaded access. If your state is already in a database and you do not need in-memory state or actor semantics, AKS with stateless services is usually simpler.
How do I reduce AKS cost?
Right-size node pools (smallest SKU that meets resource requests), use cluster autoscaler to scale in when idle, reserved instances or Savings Plans for baseline load, and tag everything for attribution. Avoid over-provisioning “just in case”; scale up when metrics justify it.
What is the minimum I need for production microservices on Azure?
Compute (AKS or Service Fabric) with Managed Identity, health checks (liveness + readiness), messaging (Service Bus or Event Grid), storage (Azure SQL or Cosmos) with Key Vault, HTTPS and Azure AD where applicable, Azure Monitor (logs + metrics + alerts), and distributed tracing with correlation IDs. Do not skip monitoring and health.
How do I structure configuration for many services?
Use Azure App Configuration or Key Vault for shared config and secrets; use environment-specific labels or key prefixes. Per service, use environment variables or mounted config maps in AKS. Avoid embedding environment names in code; use feature flags and config for behaviour.
What are ADRs and why use them for microservices?
Architecture Decision Records are short documents that capture a decision, context, and consequences. Use them so that future teams understand why you chose AKS over Service Fabric, or Service Bus over Event Grid, and can revisit when requirements change.
How do I do blue-green or canary on AKS?
Use Kubernetes deployment strategies: multiple deployments with different versions and a Service that you switch for blue-green. For canary, use two deployments with a fraction of replicas on the new version and gradually shift traffic (e.g. with Istio or a custom ingress).
Should I use Windows or Linux containers on AKS?
Use Linux containers unless you have a legacy or vendor requirement for Windows. Linux node pools are the default, have broader image support, and are often cheaper. Use Windows node pools only when necessary (e.g. .NET Framework, Windows-specific APIs).
How do I test microservices locally?
Run dependencies in Docker (Azurite, local emulators) or stubs. Use Tye or Docker Compose to run multiple services. Integration tests can target a dev Service Bus or SQL instance. Keep contract tests (e.g. consumer-driven) so that API changes are caught before deployment.
How do I secure the Service Bus namespace?
Use Managed Identity for producers and consumers. Restrict the namespace to a VNet with private endpoints. Use RBAC (e.g. Azure Service Bus Data Sender/Receiver) so each service has least privilege. Enable TLS and consider customer-managed keys for encryption at rest.
Related Guides & Resources
Explore the matching guide, related services, and more articles.