Distributed
Tracing
The operational guide to observing requests as they traverse microservices, databases, queues, and LLM APIs. Covers OpenTelemetry instrumentation, Jaeger and Zipkin deployment, context propagation, sampling strategy, and end-to-end debugging patterns.
Why Distributed Tracing
In a monolith, a slow request is obvious — profile it, find the slow function, fix it. In a distributed system spanning dozens of microservices, the request is a ghost: it enters your API gateway and touches 8 services, 3 databases, 2 queues, and 3 LLM API calls before a response returns. Which hop added the 4 seconds? Logs can't tell you. Metrics can't tell you. Traces can.
Distributed tracing reconstructs the complete journey of a request as a causally-linked tree of spans — one per operation, each with a start time, duration, service name, and any relevant metadata. It answers the questions that logs and metrics fundamentally cannot.
Discrete events per service. No cross-service correlation. You need a request ID to join logs across services — and you still can't see the timing relationships.
Aggregated rates and latencies per service. No per-request view. You know the p99 of the order service is 2s but not which downstream call caused it for which request.
Per-request causal graph of every operation. Exactly which service, which database query, which LLM call caused the slowdown — for a specific user's specific request.
What Tracing Solves That Nothing Else Does
Which downstream dependency added 3.8 seconds to this specific request? Was it the database query in service-6, or the LLM call in service-4? Traces give you the exact span with its duration, parent, and all child spans — in one view.
A 500 from the API gateway tells you nothing. A trace shows the error originated as a timeout in the recommendations service calling a third-party embedding API — 7 hops deep. Navigate directly to the failing span.
Tracing backends automatically derive your actual service dependency graph from real traffic. No manual documentation required. Discover undocumented dependencies and fanout patterns your architecture diagrams don't show.
LLM API calls (OpenAI, Anthropic, local models) are the dominant latency and cost contributor in AI-augmented services. Trace each LLM call as a span: model, token count, latency, cost, prompt hash. Correlate LLM latency spikes with user-facing degradation.
Core Concepts
Distributed tracing has a small, precise vocabulary. Understanding these five concepts is sufficient to work with any tracing system — OpenTelemetry, Jaeger, Zipkin, or a commercial APM.
| Concept | Definition | Analogy |
|---|---|---|
| Trace | The complete end-to-end record of a single request as it flows through a distributed system. A tree of spans sharing the same trace_id. |
The full itinerary of a package shipment |
| Span | A single unit of work within a trace. Has a name, span_id, parent span_id, start timestamp, duration, status, and optional attributes/events. |
One leg of the shipment journey (city → city) |
| Trace Context | The propagated headers (traceparent, tracestate) that link spans across service boundaries. Injected by the caller, extracted by the callee. |
The tracking number printed on the label |
| SpanContext | The immutable, serializable identity of a span: trace_id (128-bit), span_id (64-bit), trace flags, trace state. Propagated in-process and cross-process. |
The barcode on the package |
| Baggage | Key-value pairs propagated with the trace context across service boundaries. Useful for passing user ID, tenant, feature flag, or correlation metadata without modifying every service's API. | Notes written on the shipping label |
Span Anatomy
W3C TraceContext Headers
OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for emitting traces, metrics, and logs from applications. It replaces the fragmented ecosystem of Jaeger clients, Zipkin clients, and vendor SDKs with a single, vendor-neutral API and SDK. Instrument once, export to any backend.
Jaeger
Jaeger (pronounced "YAY-ger") is a CNCF graduated project, originally developed by Uber. It is the production standard for large-scale distributed tracing — with support for Cassandra, Elasticsearch, and OpenSearch backends, adaptive sampling, service dependency graphs, and a mature query UI. The v2 architecture unifies ingestion via the OTel Collector.
- Jaeger Collector — ingests OTLP over gRPC/HTTP, writes to storage
- Jaeger Query — serves the UI and the Jaeger Query API
- Storage backends — Elasticsearch, OpenSearch, Cassandra, Badger (dev), ClickHouse (plugin)
- Sampling service — serves adaptive sampling strategies to agents
- Native OTLP ingestion (Jaeger v2+)
- Adaptive sampling — adjusts per-service rates dynamically
- Service dependency graph from traces
- Deep span search with tag filters, duration ranges, error filters
- Trace comparison (side-by-side waterfall)
- Grafana data source plugin
Zipkin
Zipkin was the first widely-adopted open-source distributed tracing system, created by Twitter in 2012 (based on Google Dapper). It pioneered the B3 propagation format and remains the simplest path to get traces working — especially in Spring Boot ecosystems (Spring Cloud Sleuth has first-class Zipkin support out of the box).
- Zipkin Server — single process: collector + storage + UI + API
- Storage — In-memory (dev), MySQL, Cassandra, Elasticsearch
- Transport — HTTP (POST /api/v2/spans), Kafka, Scribe
- Reporter — library-side component that buffers and flushes spans to server
- Spring Boot services — native integration, zero extra config
- Smaller deployments preferring simplicity over scalability
- Existing B3-instrumented stacks being migrated incrementally
- Teams wanting a minimal single-binary deployment
- Development/staging environments
zipkinexporter — so you don't need Zipkin-specific SDKs. Instrument with OTel once and export to Zipkin (dev), Jaeger (staging), and Grafana Tempo (production) simultaneously via the Collector pipeline.Jaeger vs Zipkin
| Dimension | Jaeger | Zipkin |
|---|---|---|
| Maturity | CNCF Graduated — production battle-tested at Uber scale | Mature — Twitter-born, 12 years production use |
| Ingest Protocol | OTLP-native (v2), Jaeger Thrift, Zipkin B3 | Zipkin JSON/Proto, B3; OTel via Collector |
| Scale | High — designed for millions of spans/sec with ES/Cassandra | Medium — sufficient for most orgs; ES backend required at scale |
| UI quality | Full-featured: trace comparison, service graph, search | Clean, minimal — good for trace lookup, less analytical |
| Adaptive Sampling | YES — per-service, dynamic, server-driven | NO — client-side only, manual configuration |
| Deployment complexity | Medium — Operator recommended; more components | Low — single binary, trivial to start |
| Spring Boot integration | Via OTel SDK | Native — Spring Cloud Sleuth, zero config |
| Grafana integration | First-class data source | Via OTel Collector → Grafana Tempo |
| Choose when | Kubernetes, high volume, need adaptive sampling, Grafana stack | Spring Boot shop, simplicity priority, getting started fast |
- Native OTLP ingestion in v2 — no adapter needed
- Adaptive sampling adjusts per service automatically
- Better UI: trace comparison, dependency graph, deep search
- Kubernetes-native: Jaeger Operator manages the full stack
- Multi-tenant support via tenant headers
- Simpler architecture — one binary to start
- Spring Boot zero-config integration (Sleuth)
- Lower operational overhead for smaller teams
- B3 format ubiquitous in older instrumented stacks
- Faster to get running from zero
8-Service Trace Walkthrough
A user submits a chat message to an AI assistant app. The request traverses 8 microservices, hits 3 LLM API calls, 4 database queries, 2 cache operations, and 1 background queue. Total latency: 3,841ms. Without tracing, you'd never know the LLM orchestrator alone consumed 2,900ms.
llm-router/generate span with claude-3-5-sonnet accounts for 2,208ms — 57% of total latency. Without tracing, you'd see the API is slow but have no idea where to optimize. With the trace, you immediately know: investigate caching LLM responses, streaming, or prompt compression for the generate step.The Span Tree for This Request
Tracing LLM Calls
LLM API calls are unlike normal HTTP calls. They have unique observability requirements: token counts (prompt vs. completion), model name and version, cost attribution, finish reason (stop/length/content_filter), streaming vs. non-streaming latency, and prompt content (with PII redaction). Each LLM call deserves a rich, dedicated span.
gen_ai.*. Use these to ensure compatibility with observability platforms that understand GenAI semantics: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons.What LLM Span Attributes Enable
Sum llm.cost.input_usd + llm.cost.output_usd across all LLM spans in a trace. Aggregate by user.id, tenant.id, or feature to understand who's driving LLM costs. Build dashboards that break down inference spend by product surface.
Separate TTFT (time-to-first-token) from full generation time with span events. Identify whether latency is dominated by prompt processing or generation. Detect when a model has higher-than-usual latency indicating capacity issues on the provider side.
Track gen_ai.usage.cache_read_tokens vs gen_ai.usage.input_tokens. Measure cache hit rate per prompt template. Guide prompt engineering decisions: long system prompts benefit from prefix caching — but only if you can measure the impact.
Track gen_ai.response.finish_reasons across requests. High rate of length (truncation) means your max_tokens is too low. content_filter spikes indicate prompt injection attempts or edge-case inputs needing guardrails.
Context Propagation
Context propagation is the mechanism that links spans across service boundaries into a single trace. The calling service injects the current span context into outgoing request headers. The receiving service extracts it and creates its child span with that parent. Without correct propagation, you get disconnected single-service traces.
link to the producer span, indicating relationship without implying causality. This is the correct model for fanout patterns.Auto vs Manual Instrumentation
OTel auto-instrumentation uses bytecode manipulation (Java), monkey-patching (Python), or module wrapping (Node.js) to instrument popular libraries without code changes. It's the right starting point — but insufficient alone for business-meaningful observability.
| Layer | Auto-Instrumented | Manual Required |
|---|---|---|
| HTTP | YES — all inbound/outbound HTTP spans, status codes, routes | Business context: user.id, tenant.id, feature.flag |
| Databases | YES — query spans with statement text, table name, duration | Slow query context: row count, index used, affected records |
| Redis / Memcached | YES — command + key pattern, latency | Hit/miss rate as span events, cache key namespace |
| gRPC | YES — method, status code, propagation | Payload size, retry count, deadline remaining |
| LLM APIs | PARTIAL — openai-python has OTel contrib; Anthropic manual only | Token counts, cost, finish reason, model, prompt hash — all manual |
| Message queues | PARTIAL — Kafka, RabbitMQ instrumentations available | Consumer lag, queue depth at time of processing, span links |
| Business logic | NO | Everything: business operations, user intent, A/B variant, recommendation algorithm used |
Sampling Strategies
At scale — hundreds of services, thousands of requests per second — tracing 100% of requests is prohibitively expensive. Sampling controls which traces are recorded and stored. The key insight: you need representative coverage of normal traffic, but always need to capture errors, high latency, and critical business flows.
The sampling decision is made at the start of a request, at the root span. All downstream services inherit the decision via the sampled flag in the traceparent header. Simple to implement, but you decide to sample before you know if the trace will be interesting.
- Probabilistic — sample X% of all traces
- Rate-limiting — max N traces/second regardless of volume
- Always-on for errors — sample 100% when status=5xx
The sampling decision is made after the trace is complete — all spans are buffered, then a policy evaluates the entire trace. Captures slow, erroneous, or unusual traces that head-sampling would discard. Requires a stateful collector (OTel Collector tail-sampling processor, Grafana Tempo).
- Latency-based — keep all traces where total > 2000ms
- Error-based — keep all traces containing error spans
- Attribute-based — always keep VIP user traces
Span Attributes
Span attributes are the metadata that make a trace actually searchable and debuggable. OTel publishes semantic conventions — standardized attribute names for common operations. Use them to ensure your traces are compatible with tracing backends and APM platforms without custom parsing.
| Category | Required Attributes | Optional |
|---|---|---|
| All Services | service.name, service.version, deployment.environment |
k8s.pod.name, k8s.namespace.name, host.id |
| HTTP Server | http.method, http.route, http.response.status_code |
http.request.body.size, net.peer.ip, http.user_agent |
| HTTP Client | http.method, server.address, http.response.status_code |
http.request.body.size, net.peer.port |
| Database | db.system, db.name, db.statement (sanitized), db.operation |
db.sql.table, db.rows_affected, server.address |
| LLM / GenAI | gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens |
gen_ai.response.finish_reasons, gen_ai.response.id, llm.cost.usd |
| Messaging | messaging.system, messaging.destination.name, messaging.operation |
messaging.message.body.size, messaging.kafka.offset |
| Business | user.id, tenant.id |
feature.flag, ab.experiment, user.tier, domain-specific fields |
user.id (opaque), never user.email or user.name.Debugging with Traces
A trace is the shortest path from a user complaint to a root cause. This section describes the five most common debugging workflows and how to execute them in Jaeger or Zipkin.
Search Jaeger: service=api-gateway, tag=user.id:usr_9kx2z, min-duration=2s, last 1h. Click the highest-latency trace. Expand the waterfall. Sort spans by duration. The longest non-root span is your starting point for investigation.
Search: service=order-service, tag=error:true, last 30m. Find a representative error trace. The red (error) span tells you the exact service, span name, and exception. The span events contain the full stack trace.
Find a trace with high database span count. Filter spans by db.system=postgresql. 30 identical SELECT spans in a loop reveals the N+1 instantly — you can see the parent span that spawned them and the caller code via code.filepath attributes.
Find the trace by trace_id from your LLM cost anomaly alert. Inspect the llm.router/generate span. Check gen_ai.usage.input_tokens — was there context accumulation? Check the upstream context-service span for how much history was loaded. The causal chain is visible.
Search by duration range overlapping the timeout SLA. Compare 5 slow traces against 5 fast traces using Jaeger's compare view. Look for which span is consistently longer in slow traces — retry storms, connection pool exhaustion, and GC pauses all show distinct patterns.
Performance Analysis
The trace waterfall is your primary tool for understanding where time goes. Learn to read it fluently: gaps between spans reveal network overhead; spans that start late reveal queuing; parallel spans that collapse into a long sequential chain reveal concurrency bugs.
db.wait_ms as a span event to confirm.Deployment & Storage
| Backend | Scale | Cost |
|---|---|---|
| In-memory | Dev only | Free |
| Badger | Single node | Free |
| Elasticsearch | High | Medium |
| OpenSearch | High | Medium |
| Cassandra | Very High | High ops |
| ClickHouse | High | Low |
| Grafana Tempo | Cloud-scale | Object store |
- Run Collector as a DaemonSet (one per node) for local span collection — low latency, no network hops from pods
- Or as a Deployment (central gateway) for batching, fan-out, and tail sampling
- Use both: DaemonSet forwards to a central gateway Deployment that does tail sampling
- Size memory: 2–4GB per collector at 10k spans/sec with tail sampling enabled
- Always set
memory_limiterprocessor to prevent OOM under spike
Tracing Maturity Model
Most organizations start with ad-hoc traces on a few services and build toward full-system observability. This model shows the progression. Focus on getting all services to Level 2 before perfecting any one service at Level 4.
- No distributed tracing
- Log correlation by request ID only
- Manual service topology diagrams
- Latency debugging = log trawling
- No cross-service visibility
- OTel auto-instrumentation on HTTP
- Traces visible in Jaeger/Zipkin
- Basic sampling (probabilistic)
- Some services missing context propagation
- No LLM / queue tracing
- All services traced, propagation correct
- DB, cache, queue spans included
- Tail-based sampling for errors/slowness
- Business attributes on key spans
- OTel Collector pipeline in production
- LLM calls as rich spans with tokens/cost
- Span links for async/queue flows
- Traces correlated with logs & metrics
- Adaptive sampling per-service
- Automated anomaly detection on traces
opentelemetry.io. Jaeger documentation at jaegertracing.io. W3C TraceContext spec at w3.org/TR/trace-context. OTel GenAI semantic conventions at opentelemetry.io/docs/specs/semconv/gen-ai. Grafana Tempo for cost-efficient trace storage at object-store scale.