👋Hi, I'm Waqas — a Software Architect and Technical Consultant specializing in .NET, Azure, microservices, and API-first system design..
I help companies build reliable, maintainable, and high-performance backend platforms that scale.
Microservices Resilience: Circuit Breaker, Retry, Timeout, Bulkhead, and Fallback
Resilience in microservices: retry, circuit breaker, Polly in .NET, and best practices.
April 16, 2025 · Waqas Ahmad
Read the article
Introduction
This guidance is relevant when the topic of this article applies to your system or design choices; it breaks down when constraints or context differ. I’ve applied it in real projects and refined the takeaways over time (as of 2026).
When one microservice calls another, failures and latency are inevitable—without resilience patterns, a single slow or failing dependency can cascade and take down the system. This article covers retry, circuit breaker, timeout, bulkhead, and fallback: what each does, when to use it, and full Polly implementations in .NET with Program.cs and presentation, plus all failure scenarios and best practices. For architects and tech leads, applying these patterns (and ordering them correctly) keeps services degrading gracefully instead of cascading.
System scale: Varies by context; the approach in this article applies to the scales and scenarios described in the body.
Team size: Typically small to medium teams; ownership and clarity matter more than headcount.
Time / budget pressure: Applicable under delivery pressure; I’ve used it in both greenfield and incremental refactors.
Technical constraints: .NET and related stack where relevant; constraints are noted in the article where they affect the approach.
Non-goals: This article does not optimize for every possible scenario; boundaries are stated where they matter.
What is resilience and why it matters
Resilience means your system continues to function (or degrades gracefully) when dependencies fail or are slow. In microservices, dependencies are remote: another service, a database, or an external API. Failures can be transient (network blip, timeout, 503) or permanent (404, 400, dependency down). Without patterns:
Cascading failure — One failing dependency causes your service to pile up requests, exhaust threads, and crash; then callers of your service fail too.
Wasted resources — Retrying a dependency that is down burns CPU and threads.
Hanging — A call that never returns blocks a thread and can exhaust the pool.
Retry handles transient failures by trying again (with backoff). Circuit Breaker stops calling a failing dependency after a threshold so it can recover and you stop wasting resources. Timeout cancels long-running calls. Bulkhead limits how many concurrent calls go to a dependency so one bad dependency does not consume all threads. Fallback returns a cached or default value when the dependency fails. Together they make your service resilient.
Resilience at a glance
Pattern
What it does
When to use
Retry
Retries a failed call (with delay).
Transient failures: network blip, timeout, 5xx.
Circuit Breaker
Stops calling after N failures; opens circuit; after duration tries again (half-open).
Remote shared dependency; prevent cascade.
Timeout
Cancels the call after a duration.
Prevent hanging; fail fast.
Bulkhead
Limits concurrent calls (per dependency or global).
Isolate one dependency so it does not exhaust resources.
Fallback
Returns a default or cached value when the call fails.
Graceful degradation; avoid failing the whole request.
Retry: when, how, and all scenarios
When to retry:Transient failures only: HttpRequestException, TaskCanceledException (timeout), HTTP 5xx, 408 Request Timeout. Do not retry 4xx (except 408) or permanent errors (e.g. 404, 400).
How: Use exponential backoff (delay grows: 1s, 2s, 4s) so you do not hammer the dependency. Add jitter (random offset) so many clients do not retry at the same time. Set a max retry count (e.g. 3–5) and optionally a per-call timeout so you fail fast.
Scenarios:
Scenario
What happens
Policy
Transient network error
First call fails; retry succeeds.
Retry with backoff.
Dependency temporarily slow
First call times out; retry succeeds.
Retry + timeout per attempt.
Dependency returns 503
Retry with backoff; stop after max retries.
Retry on 5xx only; do not retry 4xx.
Dependency is down
All retries fail; circuit breaker opens after threshold.
Circuit Breaker: states, when it trips, when it resets
States:
State
Meaning
Behavior
Closed
Normal.
Calls go through. Failures are counted.
Open
Too many failures.
Calls fail immediately (no call to dependency). After a duration, move to Half-Open.
Half-Open
Testing recovery.
One trial call is allowed. Success → Closed. Failure → Open again.
When it trips (Closed → Open): After N consecutive failures or N failures in a time window (e.g. 5 failures in 30 seconds). Polly: CircuitBreakerAsync(failureThreshold, durationOfBreak) or AdvancedCircuitBreakerAsync(failureThreshold, samplingDuration, minimumThroughput).
When it resets (Open → Half-Open): After durationOfBreak (e.g. 30 seconds). Polly automatically moves to Half-Open and allows one call.
Half-Open: One success → Closed. One failure → Open again (reset the break duration).
After break duration, one trial call succeeds; circuit closes.
Trial call fails
Circuit reopens; break duration starts again.
Timeout, Bulkhead, and Fallback
Timeout: Cancel the call after a duration (e.g. 10 seconds). Prevents one slow dependency from hanging your thread. Polly: Policy.TimeoutAsync(TimeSpan.FromSeconds(10)) or TimeoutStrategy.Optimistic (CancellationToken).
Bulkhead: Limit concurrent calls (e.g. max 10 to OrderApi). If 10 are in flight, the 11th waits or fails. Isolates one dependency so it cannot exhaust all threads. Polly: Policy.BulkheadAsync(maxParallelization, maxQueuingActions).
Fallback: When the call fails (after retries and/or circuit open), return a default value or cached value. Polly: Policy.Handle<...>().FallbackAsync(fallbackAction) or FallbackAsync(fallbackValue).
Recommended order (outer to inner):Fallback (outer) → Circuit Breaker → Retry → Timeout → Bulkhead (inner, closest to the call). So: Bulkhead limits concurrency, Timeout cancels long calls, Retry retries transient failures, Circuit Breaker opens after repeated failure, Fallback returns default when circuit is open or all retries fail.
Polly: PolicyWrap
var bulkhead = Policy.BulkheadAsync<HttpResponseMessage>(10, 100);
var timeout = Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromSeconds(10));
var retry = Policy<HttpResponseMessage>.Handle<HttpRequestException>().OrResult(r => !r.IsSuccessStatusCode)
.WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));
var circuitBreaker = Policy<HttpResponseMessage>.Handle<HttpRequestException>().OrResult(r => !r.IsSuccessStatusCode)
.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30));
var fallback = Policy<HttpResponseMessage>.Handle<HttpRequestException>().OrResult(r => !r.IsSuccessStatusCode)
.FallbackAsync((CancellationToken ct) => Task.FromResult(new HttpResponseMessage(HttpStatusCode.OK) { Content = ... }));
var pipeline = Policy.WrapAsync(fallback, circuitBreaker, retry, timeout, bulkhead);
All failure scenarios and how to handle them
Scenario
What to do
Transient network error
Retry with backoff; circuit opens if many fail.
Dependency timeout
Per-call timeout + retry (next attempt may succeed).
Dependency returns 5xx
Retry; circuit breaker after threshold.
Dependency returns 4xx
Do not retry (client error). Fail fast.
Dependency down (connection refused)
Retry a few times; circuit opens; fallback or fail.
Cascading failure
Circuit breaker stops calling; bulkhead limits concurrency so one dependency does not exhaust threads.
Partial failure (one instance down)
Retry (next request may hit healthy instance); circuit breaker per dependency.
Half-open: trial succeeds
Circuit closes; normal operation.
Half-open: trial fails
Circuit reopens; break duration restarts.
Non-idempotent call
Do not retry; or use idempotency key and retry.
Slow dependency
Timeout per call; retry if timeout is transient.
Polly in .NET: full implementations
Package:Polly.Extensions.Http or Polly + Microsoft.Extensions.Http.Polly.
Circuit Breaker:Handle<HttpRequestException>().OrResult(r => !r.IsSuccessStatusCode).CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)). Advanced:AdvancedCircuitBreakerAsync(0.5, TimeSpan.FromSeconds(30), 10) — opens when 50% of at least 10 calls fail in 30 seconds.
Flow: Client calls your API → OrdersController → OrderService → HttpClient (named OrderApi). Each GetAsync is wrapped by Timeout → Retry → Circuit Breaker (order is inner to outer when using AddPolicyHandler: first handler is outermost). So: timeout applies first, then retry, then circuit breaker. If the circuit is open, Polly throws BrokenCircuitException; you can catch and return 503 or use a fallback policy.
Best practices and pitfalls
Do:
Retry only transient failures (5xx, timeout, HttpRequestException). Do not retry 4xx (except 408).
Use exponential backoff and jitter for retry.
Use a shared circuit breaker per dependency (same HttpClient name = same policy instance in typical setup).
Set timeout per call so one slow dependency does not hang.
Monitor circuit state (Closed/Open/Half-Open) and log onBreak, onReset, onHalfOpen.
Use fallback for read-only or optional data (cached or default); avoid fallback for non-idempotent writes unless you use idempotency keys.
Pitfalls:
Retrying non-idempotent operations — Double payment, double order. Use idempotency key or do not retry.
Circuit too sensitive — Opens on a small number of failures; tune threshold and sampling duration.
Circuit too slow — Long break duration; dependency recovers but circuit stays open. Tune duration.
No timeout — One hanging call can exhaust threads; always set timeout.
Ignoring half-open — Ensure one failure in half-open reopens the circuit (Polly does this by default).
Summary
Resilience means the system keeps working or degrades gracefully when dependencies fail; the core patterns are retry (transient), circuit breaker (stop calling a failing dependency), timeout (no hanging), bulkhead (isolation), and fallback (default value). Skipping them leads to cascading failures and outages; applying them in the right order (e.g. fallback → circuit breaker → retry → timeout → bulkhead from outer to inner) with Polly in .NET keeps callers stable. Next, add retry with exponential backoff and jitter for your critical outbound calls, then add circuit breaker and timeout; use Polly’s AddPolicyHandler and Policy.WrapAsync for combined policies.
Resilience = system keeps working or degrades gracefully when dependencies fail. Retry (transient), Circuit Breaker (stop calling failing dependency), Timeout (no hanging), Bulkhead (isolation), Fallback (default value).
Retry: Exponential backoff + jitter; max count; only for 5xx/timeout/transient. Circuit Breaker: Closed → Open (after N failures) → Half-Open (after duration) → Closed if trial succeeds.
Order: Fallback → Circuit Breaker → Retry → Timeout → Bulkhead (inner). Polly in .NET: AddPolicyHandler for HttpClient; use Policy.WrapAsync for custom order.
All scenarios: Transient → retry; dependency down → circuit opens; timeout → per-call timeout; non-idempotent → do not retry or use idempotency key.
Full code: Centralised policies (Retry, CircuitBreaker, Timeout, Bulkhead), Program.csAddHttpClient + AddPolicyHandler, service uses IHttpClientFactory, controller calls service.
Position & Rationale
I apply retry only for transient failures (5xx, timeout, HttpRequestException); I never retry 4xx (except 408) or non-idempotent operations without an idempotency key. I use circuit breaker on every outbound call to a shared dependency so we stop hammering it when it’s down and fail fast. I always set a per-call timeout so a hanging dependency doesn’t exhaust threads. I combine policies in a clear order: Fallback (outer) so we have a safe default, then Circuit Breaker, Retry, Timeout, Bulkhead (inner). I use Polly with IHttpClientFactory so each named client has its own policy and we don’t share state across unrelated calls. I reject retrying non-idempotent operations (e.g. payment) unless we have idempotency keys and server-side deduplication.
Trade-Offs & Failure Modes
What this sacrifices: Retry adds latency on transient failure; circuit breaker adds “fail fast” when the circuit is open (users see errors until half-open succeeds). Fallback can hide real outages if overused.
Where it degrades: Wrong policy order (e.g. retry outside circuit breaker) wastes calls on a down dependency. Retrying non-idempotent calls causes duplicate side effects. No timeout lets one slow dependency block the pool. Too many bulkhead partitions or too few can under- or over-isolate.
How it fails when misapplied: Retrying 4xx (e.g. 400, 404) never helps and burns resources. Circuit breaker with too low a threshold opens on normal blips; too high and we cascade. Fallback that always returns success masks dependency failure and can corrupt data if the fallback is “return stale.”
Early warning signs: “We’re still calling the failing service thousands of times”; “timeouts are piling up”; “circuit never closes” or “opens too easily”; duplicate payments or orders after retries.
What Most Guides Miss
Docs often show retry and circuit breaker in isolation and skip policy order and non-idempotent operations. In practice, order matters: retry inside circuit breaker so we don’t burn retries when the circuit is open; timeout inside retry so each attempt is bounded. Another gap: idempotency—if the operation is not idempotent (e.g. “create order”), retry can double-submit; you need idempotency keys and server-side deduplication or you must not retry. Many guides also don’t stress per-dependency policies (e.g. one circuit breaker per downstream service) so one bad dependency doesn’t share state with others.
Decision Framework
If you call a remote dependency (HTTP, gRPC) → Use timeout and retry (transient only); add circuit breaker so repeated failure stops calls; add fallback only when you have a safe default.
If the operation is non-idempotent → Do not retry, or use idempotency key and server-side deduplication; otherwise retry can double-apply.
If you have multiple dependencies → Use separate policies per dependency (named HttpClient, separate circuit breaker state) so one failure doesn’t affect others.
If the dependency is slow or flaky → Use bulkhead to limit concurrent calls; combine with timeout so one slow dependency doesn’t exhaust the pool.
If the circuit opens too often or never → Tune failure threshold and duration; add logging so you can see why it opened and when it half-opens.
Retry for transient failures only (5xx, timeout); circuit breaker to stop calling a failing dependency; timeout on every outbound call; bulkhead to isolate; fallback only when safe.
Policy order: Fallback → Circuit Breaker → Retry → Timeout → Bulkhead (inner). Use Polly with IHttpClientFactory and per-client policies.
Do not retry non-idempotent operations without idempotency; tune circuit breaker threshold and duration from real failure patterns.
Need help designing resilient microservices? I support teams with domain boundaries, service decomposition, and distributed systems architecture.
When I Would Use This Again — and When I Wouldn’t
I would use these resilience patterns again for any service that calls remote dependencies—HTTP APIs, gRPC, or queues. I’d apply retry, circuit breaker, and timeout as the baseline; add bulkhead when we have multiple dependencies or high concurrency; add fallback only when we have a defined safe default. I wouldn’t retry non-idempotent operations (e.g. payment, order submit) without idempotency keys. I wouldn’t skip circuit breaker to “simplify”—cascading failure is costlier than the extra config. If we’re in a single process with no remote calls, resilience policies add little; once we have outbound dependencies, they’re mandatory in my view.
Frequently Asked Questions
Frequently Asked Questions
What is a Circuit Breaker?
Stops calling a failing dependency after a threshold of failures. States: Closed (normal), Open (no calls, fail fast), Half-Open (one trial). After a duration, circuit moves to Half-Open; one success closes it, one failure reopens it.
What is Retry?
Retries a failed call with a delay. Use for transient failures (network, timeout, 5xx). Use exponential backoff and jitter; set max retry count.
When use Retry vs Circuit Breaker?
Use Retry for transient failures (a few retries). Use Circuit Breaker for remote shared dependencies so that after many failures you stop calling and let the dependency recover. Combine: retry a few times, then let the circuit open if failures persist.
Circuit Breaker states?
Closed — calls go through; failures counted. Open — no calls; fail immediately. Half-Open — one trial call; success → Closed, failure → Open.
When use Timeout?
Always for outbound calls. Prevents one slow dependency from hanging your thread. Typical: 10–30 seconds per call.
What is Bulkhead?
Limits concurrent calls to a dependency (e.g. max 10). Prevents one bad dependency from consuming all threads and causing cascading failure.
What is Fallback?
Returns a default or cached value when the call fails (after retries or when circuit is open). Use for graceful degradation; avoid for non-idempotent writes.
How combine Retry, Circuit Breaker, Timeout?
Order (outer to inner): Fallback → Circuit Breaker → Retry → Timeout → (Bulkhead). In Polly: AddPolicyHandler(Timeout), AddPolicyHandler(Retry), AddPolicyHandler(CircuitBreaker) — first added is outermost.
How implement in .NET?
Use Polly and IHttpClientFactory. AddHttpClient("Name", ...).AddPolicyHandler(GetRetryPolicy()).AddPolicyHandler(GetCircuitBreakerPolicy()). All calls with that client share the same policies and (for circuit breaker) the same state.
Retry on 5xx only?
Yes. Retry on 5xx and 408 and HttpRequestException. Do not retry 4xx (client error) except 408.
What is exponential backoff?
Delay between retries grows: e.g. 1s, 2s, 4s. Prevents hammering the dependency. Add jitter (random offset) so many clients do not retry at once.
What is jitter?
Random offset added to the retry delay. Prevents thundering herd when many clients retry at the same time.
Circuit Breaker threshold?
Common: 5 failures in a row or 50% of at least 10 calls in 30 seconds. Tune per dependency.
Half-open behavior?
After the break duration, circuit moves to Half-Open. One trial call is allowed. Success → Closed. Failure → Open again (break duration restarts).
Non-idempotent and retry?
Do not retry non-idempotent operations (e.g. payment) without an idempotency key. Otherwise retry can double-submit.
Related Guides & Resources
Explore the matching guide, related services, and more articles.