Waqas Ahmad — Software Architect & Technical Consultant - Available USA, Europe, Global

Waqas Ahmad — Software Architect & Technical Consultant

Specializing in

Distributed Systems

.NET ArchitectureCloud-Native ArchitectureAzure Cloud EngineeringAPI ArchitectureMicroservices ArchitectureEvent-Driven ArchitectureDatabase Design & Optimization

👋 Hi, I'm Waqas — a Software Architect and Technical Consultant specializing in .NET, Azure, microservices, and API-first system design..
I help companies build reliable, maintainable, and high-performance backend platforms that scale.

Experienced across engineering ecosystems shaped by Microsoft, the Cloud Native Computing Foundation, and the Apache Software Foundation.

Available for remote consulting (USA, Europe, Global) — flexible across EST, PST, GMT & CET.

services
Article

Observability for .NET on Azure

Logging, metrics, tracing in .NET on Azure: Application Insights and OpenTelemetry.

services
Read the article

Introduction

This guidance is relevant when the topic of this article applies to your system or design choices; it breaks down when constraints or context differ. I’ve applied it in real projects and refined the takeaways over time (as of 2026).

In microservices and distributed systems, when something goes wrong you need to know where and why quickly—observability (logs, metrics, traces) is what makes that possible. This article covers observability for .NET on Azure: Application Insights, OpenTelemetry, structured logging, custom metrics, and distributed tracing with correlation IDs. For architects and tech leads, consistent logs, metrics, and traces reduce mean time to diagnosis and build confidence in releases.

If you are new to observability, start with Topics covered and Observability at a glance.

For a deeper overview of this topic, explore the full Cloud-Native Architecture guide.

Decision Context

  • System scale: Varies by context; the approach in this article applies to the scales and scenarios described in the body.
  • Team size: Typically small to medium teams; ownership and clarity matter more than headcount.
  • Time / budget pressure: Applicable under delivery pressure; I’ve used it in both greenfield and incremental refactors.
  • Technical constraints: .NET and related stack where relevant; constraints are noted in the article where they affect the approach.
  • Non-goals: This article does not optimize for every possible scenario; boundaries are stated where they matter.

What is observability and why it matters

Observability is the ability to understand the internal state of a system from its outputs—logs, metrics, and traces. When something goes wrong in production, you need to know where and why quickly, without guessing or adding more logging after the fact. In microservices and distributed systems, a single user request may touch many services and many databases; without correlated logs and distributed traces, debugging becomes guesswork.

Why it matters: Good observability reduces mean time to diagnosis (MTTD), builds confidence in releases (you can verify behaviour and catch regressions), and enables on-call teams to react to alerts with context. From BAT in-house and other production systems, I have seen MTTD drop when logs, metrics, and traces are consistent, structured, and correlated across services. This article covers observability for .NET on Azure: Application Insights, OpenTelemetry, structured logging, custom metrics, and distributed tracing with correlation IDs.


Observability at a glance

Concept What it is
Logs Discrete events (e.g. “Request started”, “Error”, “Order processed”). Structured logs have named properties (OrderId, CustomerId) so you can query and filter.
Metrics Aggregated values over time (e.g. request count, latency p95, error rate). Used for dashboards and alerts.
Traces Request flow across services: a trace is a tree of spans (units of work). Each span has ID, parent, name, duration. Distributed tracing links spans across services via trace ID in headers.
Correlation ID A business or request ID (e.g. order ID, request ID) you add to logs and headers so support and developers can search by business context.
Trace ID (W3C) Standard header (traceparent) that links spans across services so one request shows as one trace in Azure Monitor or Jaeger.
Application Insights Azure service that ingests logs, metrics, and traces from .NET (and others); provides search, dashboards, alerts, and dependency tracking.
OpenTelemetry Vendor-neutral API and SDK for logs, metrics, traces; you can export to Azure Monitor, Jaeger, Zipkin, or any OTLP backend.
Loading diagram…

The three pillars: logs, metrics, and traces

Logs are discrete events: “Request started”, “Order processed”, “Error: connection timeout”. They give you detail for debugging. Use structured logging (named properties like OrderId, CustomerId) so that the backend can index and query instead of grepping free text. Include trace ID and correlation ID in every log entry so you can follow a single request across services.

Metrics are aggregated values over time: request count, latency p95, error rate, queue depth. They power dashboards and alerts. You typically sample or aggregate (e.g. count per minute) rather than logging every event. Use custom metrics for business and technical health (e.g. orders processed, cache hit rate).

Traces represent request flow across services. A trace is a tree of spans; each span is a unit of work (one HTTP request, one DB call). Spans have parent-child relationships and trace ID so that one user request shows as one trace in Azure Monitor or Jaeger. Distributed tracing requires propagating trace ID (e.g. W3C Trace Context headers) in all outbound HTTP and message calls.

In .NET on Azure, you use Application Insights or OpenTelemetry to export logs, metrics, and traces to Azure Monitor (or Jaeger, Zipkin). Correlation (trace ID, correlation ID) ties them together so you can search by request or business context.


Structured logging in .NET with correlation

Use ILogger with named parameters so that the backend can index and query. Include correlation ID (e.g. from middleware or request header) and trace ID (from Activity.Current?.Id) in every request so you can filter and follow a single flow.

// Structured logging with correlation
_logger.LogInformation(
    "Order {OrderId} processed for customer {CustomerId}. TraceId: {TraceId}",
    orderId, customerId, Activity.Current?.Id ?? "none");

Correlation ID middleware: Read or generate a correlation ID (e.g. from X-Correlation-ID header or Guid.NewGuid()), add it to HttpContext.Items and to the response headers so that downstream and clients can use it. Log it in a scope so every log entry in that request gets the correlation ID:

using (logger.BeginScope(new Dictionary<string, object> { ["CorrelationId"] = correlationId }))
{
    // all logs in this scope include CorrelationId
}

Application Insights (or your logging sink) ingests these and indexes them; you can search by OrderId, CorrelationId, or TraceId.


Application Insights: setup and configuration

Application Insights is Azure’s APM: it ingests logs (from ILogger), metrics (including custom), and traces (with dependency tracking). You add the SDK to your ASP.NET Core app and set the connection string (or instrumentation key) so that telemetry is sent to your Application Insights resource.

Minimal setup: Add package Microsoft.ApplicationInsights.AspNetCore, then in Program.cs:

builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});

Store the connection string in Azure Key Vault or App Configuration; never in source control. W3C Trace Context is supported so that distributed tracing works across services that also use Application Insights or OpenTelemetry.

Custom metrics: Use TelemetryClient.TrackMetric or the meter API; metrics appear in Azure Monitor and can be used in alert rules and dashboards.


OpenTelemetry: setup and configuration

OpenTelemetry is a vendor-neutral API and SDK for logs, metrics, and traces. You add the OpenTelemetry packages and export to Azure Monitor (or OTLP, Jaeger, Zipkin). This keeps your code independent of Application Insights so you can switch backends later.

Typical setup in Program.cs:

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing.AddAspNetCoreInstrumentation()
              .AddHttpClientInstrumentation()
              .AddSqlClientInstrumentation();
        tracing.AddAzureMonitorTraceExporter(o =>
            o.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"]);
    })
    .WithMetrics(metrics =>
    {
        metrics.AddAspNetCoreInstrumentation()
               .AddHttpClientInstrumentation();
        metrics.AddAzureMonitorMetricExporter(o =>
            o.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"]);
    });

Propagate trace context in outbound HTTP and message calls so that downstream services are linked. The ASP.NET Core and HttpClient instrumentations do this for HTTP when W3C is enabled (default). For ILogger, use the OpenTelemetry logging provider or continue sending logs to Application Insights via the Application Insights SDK so that logs and traces are in one place.


Custom metrics and dashboards

Custom metrics (e.g. orders processed, cache hit rate, queue depth) help you track business and technical health. With Application Insights, use TelemetryClient.TrackMetric or the Meter API; with OpenTelemetry, use Meter.CreateCounter, CreateHistogram, etc., and export to Azure Monitor.

// Application Insights
_telemetry.TrackMetric("Orders.Processed", count, new Dictionary<string, string> { ["Environment"] = "Production" });

// OpenTelemetry meter
_counter.Add(1, new KeyValuePair<string, object>("env", "Production"));

Build dashboards in Azure Monitor from these metrics and from default ones (request rate, latency, failure rate). Use dashboards for at-a-glance health and alerts so that on-call reacts when thresholds are breached.


Distributed tracing and W3C Trace Context

Distributed tracing lets you follow a single request across services. Each service creates a span for its work and propagates trace ID (and span ID) to downstream calls via W3C Trace Context headers (traceparent, tracestate). In Azure Monitor (or Jaeger, Zipkin), you see the full trace and where latency or errors occur.

Correlation ID: In addition to trace ID, use a business correlation ID (e.g. order ID, request ID) in logs and headers so that support and developers can search by business context. Propagate both in HTTP and message headers so that every service in the path can log and link them.

Loading diagram…

Sampling and cost control

Application Insights and similar backends can get expensive at high volume. Sample traces (e.g. 10% of requests) and log 100% of errors so that cost stays under control but failures are never missed. Use adaptive sampling (Application Insights) or head-based sampling (OpenTelemetry) so that related spans are either all sampled or all dropped—never half a trace.

Tune per environment: 100% in dev for full visibility; 10–20% in prod for cost. Aggregate metrics instead of logging every event. Archive old data to cheaper storage. Limit custom events and dependencies if they drive cost.


Alerts and runbooks

Collecting metrics without alerts means no one reacts when things break. Define alert rules for latency (e.g. p95 > 500ms), error rate (e.g. > 1%), dependency failures (e.g. database, external API), and availability if you have health checks. Use Azure Monitor alert rules on metrics or log queries (e.g. error count > 10 in 5 minutes). Action groups can notify (email, Slack, PagerDuty) or run runbooks.

Runbooks: Document what to do when each alert fires (e.g. “Check DB connection pool; scale out if needed; check dependency status”). On-call should have clear steps so that mean time to resolution stays low.


Common issues and challenges

Logging too much or too little: Too much and cost and noise; too little and you cannot debug. Use structured logging with levels (Info for normal, Warning for retries, Error for failures) and sample in high-throughput scenarios.

Missing correlation: Without trace ID or correlation ID across services, you cannot follow a request end-to-end. Propagate trace ID and correlation ID in headers and log them in every entry; use W3C Trace Context or your APM’s format.

Metrics without alerts: Define alerts for latency, error rate, and dependency failures; tune thresholds to avoid noise. Document runbooks for common alerts.

High cost: Sample traces and logs; aggregate metrics; archive old data. Tune sampling per environment.

Ignoring distributed tracing: Enable OpenTelemetry or Application Insights with distributed tracing and propagate trace ID in all outbound calls so the full path is visible.


Summary

Structured logging, custom metrics, and distributed tracing with correlation IDs are the foundation of observability for .NET on Azure; use Application Insights or OpenTelemetry to export to Azure Monitor and sample to control cost. Logging without structure or missing correlation makes diagnosis slow; no alerts mean you find out about failures from users. Next, add structured logging with correlation IDs, wire up Application Insights or OpenTelemetry, then add custom metrics and alerts; use the FAQs below as a quick reference.


Position & Rationale

I treat structured logging (consistent fields, correlation ID on every span) as non-negotiable for any .NET service that runs in production. I use distributed tracing (Application Insights or OpenTelemetry) so a single request can be followed across APIs and queues; I propagate the same trace/correlation ID in headers and logs. I add custom metrics only for business or reliability signals we actually act on. I sample when volume is high and tune retention and sampling so cost stays predictable. I reject logging without structure and deploying without correlation ID propagation.


Trade-Offs & Failure Modes

  • What this sacrifices: Some simplicity, extra structure, or operational cost depending on the topic; the article body covers specifics.
  • Where it degrades: Under scale or when misapplied; early warning signs include drift from the intended use and repeated workarounds.
  • How it fails when misapplied: Using it where constraints don’t match, or over-applying it. The “When I Would Use This Again” section below reinforces boundaries.
  • Early warning signs: Team confusion, bypasses, or “we’re doing X but not really” indicate a mismatch.

What Most Guides Miss

Most guides show how to add Application Insights and maybe logs, then stop. What they skip: what to actually trace—every span is cost and noise if you’re not answering a question; correlation across services (trace ID propagation) so one request doesn’t look like 10 unrelated operations; and sampling—when you have high traffic, you can’t keep everything, so you need a strategy (head-based, tail-based) or you’ll miss the slow request. The hard part is the strategy and the “what do we need to debug in production,” not the SDK call. Most posts don’t tie instrumentation to real incident scenarios.


Decision Framework

  • If the context matches the assumptions in this article → Apply the approach as described; adapt to your scale and team.
  • If constraints differ → Revisit Decision Context and Trade-Offs; simplify or choose an alternative.
  • If you’re under heavy time pressure → Use the minimal subset that gives the most value; expand later.
  • If ownership is unclear → Clarify before scaling the approach; unclear ownership is an early warning sign.

You can also explore more patterns in the Cloud-Native Architecture resource page.

Key Takeaways

  • The article body and Summary capture the technical content; this section distils judgment.
  • Apply the approach where context and constraints match; avoid over-application.
  • Trade-offs and failure modes are real; treat them as part of the decision.
  • Revisit “When I Would Use This Again” when deciding on a new project or refactor.

For production-grade Azure systems, I offer consulting on cloud architecture, scalability, and cloud-native platform design.

When I Would Use This Again — and When I Wouldn’t

I would use this observability approach again for any .NET service that runs in production—structured logging, correlation ID, and distributed tracing are baseline in my view. I’d add Application Insights or OpenTelemetry and tune sampling and retention so we can trace requests and control cost. I wouldn’t deploy without correlation ID propagation; I wouldn’t add custom metrics we don’t act on. For a tiny internal tool with one service and no dependencies, minimal logging may be enough; as soon as we have multiple services or queues, full observability pays off. If the team doesn’t own alerting and runbooks, I’d still add the instrumentation and then fix the process so someone acts on the signals.


services
Frequently Asked Questions

Frequently Asked Questions

How do I add distributed tracing to my .NET application?

Use OpenTelemetry or Application Insights SDK; both support W3C Trace Context. Add the SDK to your app and configure export to Azure Monitor (or Jaeger, Zipkin). Propagate trace ID in HTTP headers and message headers so that downstream services are linked.

What is the difference between logs, metrics, and traces?

Logs: discrete events (e.g. “Request started”, “Error”). Metrics: aggregated values (e.g. request count, latency p95). Traces: request flow across services (spans with parent-child). Use all three: logs for detail, metrics for dashboards and alerts, traces for request flow.

How do I reduce Application Insights cost?

Sample traces and logs (e.g. 10% of requests, or 100% of errors). Aggregate metrics instead of logging every event. Limit custom events and dependencies. Archive old data to cheaper storage. Tune sampling per environment (e.g. 100% in dev, 10% in prod).

What should I alert on for a .NET API on Azure?

Latency (e.g. p95 > 500ms), error rate (e.g. > 1%), dependency failures (e.g. database, external API). Availability if you have health checks. Tune thresholds to avoid noise; document runbooks so that on-call knows what to do.

How do I correlate logs across microservices?

Propagate trace ID (and correlation ID) in HTTP headers and message headers; each service logs them. Use Application Insights or OpenTelemetry so that traces are linked by trace ID; search by trace ID to see the full request path.

What are common mistakes with observability?

Logging without structure (free text that cannot be queried). Missing correlation so that cross-service debugging is impossible. No alerts so that no one reacts when things break. Over-logging so that cost explodes. Structured logs, traces, metrics, and alerts together make production debuggable.

What is W3C Trace Context?

W3C Trace Context is a standard for propagating trace ID and span ID in HTTP headers (traceparent, tracestate) so that distributed traces are linked across services. Use OpenTelemetry or Application Insights SDK with W3C enabled so that outbound HTTP calls automatically add these headers and downstream services continue the trace.

How do I add OpenTelemetry to a .NET API?

Add OpenTelemetry packages (e.g. OpenTelemetry.Instrumentation.AspNetCore, .HttpClient, .SqlClient) and export to Azure Monitor (or OTLP endpoint). Configure TracerProvider and MeterProvider in Program.cs; the SDK will create spans and metrics for HTTP, SQL, and other instrumented operations. Propagate trace context in outbound calls so that downstream services are linked.

What is a span in distributed tracing?

A span represents a unit of work (e.g. one HTTP request, one DB call). Each span has an ID, parent span ID, name, start/end time, and attributes. A trace is a tree of spans; the root span is the entry point (e.g. API request); child spans are downstream calls. Viewing the trace shows you the full path and where latency or errors occur.

How do I log structured data in .NET?

Use ILogger with named parameters (e.g. _logger.LogInformation("Order {OrderId} processed", orderId)) so that the backend can index and query by OrderId. Use Serilog or similar for rich structured properties. Include trace ID, correlation ID, and user ID in every request so that you can filter and follow a single flow.

What should I sample in high-throughput scenarios?

Sample traces (e.g. 10% of requests) and log 100% of errors so that cost stays under control but failures are never missed. Use adaptive sampling (Application Insights) or head-based sampling (OpenTelemetry) so that related spans are either all sampled or all dropped—never half a trace.

How do I set up alerts for a .NET API on Azure?

Use Azure Monitor (or Application Insights) to create alert rules on metrics (e.g. latency p95 > 500ms, error rate > 1%) or log queries (e.g. error count > 10 in 5 minutes). Action groups can notify (email, Slack, PagerDuty) or run runbooks. Document runbooks so that on-call knows what to do when an alert fires.

What is the difference between ILogger and Application Insights?

ILogger is the .NET abstraction for logging; you use it in code. Application Insights is a backend that can ingest ILogger output (via a sink or the Application Insights SDK) and index it for search, correlate it with traces, and alert on it. You can use ILogger with Serilog and export to Application Insights, or use the Application Insights SDK directly.

How do I reduce noise in logs?

Use log levels (e.g. Info for normal, Warning for retries, Error for failures); avoid logging every request at Info if you have high throughput. Sample or aggregate instead. Use structured properties so that you can filter in the backend (e.g. only errors, only a specific correlation ID) instead of grepping free text.

What is correlation ID vs trace ID?

Trace ID is from the tracing standard (W3C); it links spans across services. Correlation ID is often a business or request ID (e.g. order ID, request ID) that you add to logs and headers so that support and developers can search by business context. Use both: trace ID for the full technical path, correlation ID for business context.

How do I debug a slow request across microservices?

Use distributed tracing: search by trace ID (or correlation ID) in Application Insights or your APM to see the full trace (all spans). Identify the slowest span (e.g. a DB call or external API); drill into that service’s logs and metrics. Correlation is key—without trace ID propagation, you cannot see the full path.

services
Related Guides & Resources

services
Related services