??????? Distributed Tracing — Field Handbook
TRACE-HDK-2024
FIELD HANDBOOK
DTH
Field Handbook

Distributed
Tracing

// "If you can't see the path, you can't fix the bottleneck."

The operational guide to observing requests as they traverse microservices, databases, queues, and LLM APIs. Covers OpenTelemetry instrumentation, Jaeger and Zipkin deployment, context propagation, sampling strategy, and end-to-end debugging patterns.

Jaeger Zipkin OpenTelemetry LLM Tracing W3C TraceContext Debugging
01

Why Distributed Tracing

// THE OBSERVABILITY GAP

In a monolith, a slow request is obvious — profile it, find the slow function, fix it. In a distributed system spanning dozens of microservices, the request is a ghost: it enters your API gateway and touches 8 services, 3 databases, 2 queues, and 3 LLM API calls before a response returns. Which hop added the 4 seconds? Logs can't tell you. Metrics can't tell you. Traces can.

Distributed tracing reconstructs the complete journey of a request as a causally-linked tree of spans — one per operation, each with a start time, duration, service name, and any relevant metadata. It answers the questions that logs and metrics fundamentally cannot.

Logs — What Happened

Discrete events per service. No cross-service correlation. You need a request ID to join logs across services — and you still can't see the timing relationships.

Metrics — How Much

Aggregated rates and latencies per service. No per-request view. You know the p99 of the order service is 2s but not which downstream call caused it for which request.

Traces — Why & Where

Per-request causal graph of every operation. Exactly which service, which database query, which LLM call caused the slowdown — for a specific user's specific request.

The three pillars of observability: Logs, metrics, and traces are complementary — not alternatives. Traces tell you where the problem is. Metrics tell you it's happening in aggregate. Logs tell you what data was involved. All three should be correlated: traces enriched with log links, metrics tagged with exemplar trace IDs.

What Tracing Solves That Nothing Else Does

Latency Attribution Critical

Which downstream dependency added 3.8 seconds to this specific request? Was it the database query in service-6, or the LLM call in service-4? Traces give you the exact span with its duration, parent, and all child spans — in one view.

Error Root Cause Debug

A 500 from the API gateway tells you nothing. A trace shows the error originated as a timeout in the recommendations service calling a third-party embedding API — 7 hops deep. Navigate directly to the failing span.

Dependency Mapping Topology

Tracing backends automatically derive your actual service dependency graph from real traffic. No manual documentation required. Discover undocumented dependencies and fanout patterns your architecture diagrams don't show.

LLM Cost & Latency Visibility AI/ML

LLM API calls (OpenAI, Anthropic, local models) are the dominant latency and cost contributor in AI-augmented services. Trace each LLM call as a span: model, token count, latency, cost, prompt hash. Correlate LLM latency spikes with user-facing degradation.

02

Core Concepts

// TRACES, SPANS, CONTEXT

Distributed tracing has a small, precise vocabulary. Understanding these five concepts is sufficient to work with any tracing system — OpenTelemetry, Jaeger, Zipkin, or a commercial APM.

ConceptDefinitionAnalogy
Trace The complete end-to-end record of a single request as it flows through a distributed system. A tree of spans sharing the same trace_id. The full itinerary of a package shipment
Span A single unit of work within a trace. Has a name, span_id, parent span_id, start timestamp, duration, status, and optional attributes/events. One leg of the shipment journey (city → city)
Trace Context The propagated headers (traceparent, tracestate) that link spans across service boundaries. Injected by the caller, extracted by the callee. The tracking number printed on the label
SpanContext The immutable, serializable identity of a span: trace_id (128-bit), span_id (64-bit), trace flags, trace state. Propagated in-process and cross-process. The barcode on the package
Baggage Key-value pairs propagated with the trace context across service boundaries. Useful for passing user ID, tenant, feature flag, or correlation metadata without modifying every service's API. Notes written on the shipping label

Span Anatomy

JSON — Span Structure (OTLP)
{ "traceId": "3fa85f64-5717-4562-b3fc-2c963f66afa6", "spanId": "a1b2c3d4e5f60001", "parentSpanId": "0000000000000001", // absent on root span "name": "POST /v1/recommendations", "kind": "SERVER", // SERVER | CLIENT | PRODUCER | CONSUMER | INTERNAL "startTimeUnixNano": 1715000000000000000, "endTimeUnixNano": 1715000000843000000, // 843ms "status": { "code": "OK" }, // OK | ERROR | UNSET "attributes": { // HTTP semantic conventions (OpenTelemetry semconv) "http.method": "POST", "http.route": "/v1/recommendations", "http.response.status_code": 200, "server.address": "recommendations.svc.cluster.local", "http.request.body.size": 1024, // Custom business attributes "user.id": "usr_9kx2z", "recommendation.count": 12, "model.used": "colbert-v2" }, "events": [ { "name": "cache.miss", "timeUnixNano": 1715000000012000000, "attributes": { "cache.key": "recs:usr_9kx2z:v2" } }, { "name": "embedding.computed", "timeUnixNano": 1715000000524000000, "attributes": { "embedding.dims": 768, "embedding.model": "text-embedding-ada-002" } } ], "links": [ { // Link to async span in a queue processor — cross-trace relationship "traceId": "9c4fd3a1...", "spanId": "b7c2e4f8...", "attributes": { "link.type": "enqueued_by" } } ] }

W3C TraceContext Headers

HTTP Headers — W3C TraceContext (RFC)
# The standard propagation format — use this over Zipkin B3 for new systems traceparent: 00-3fa85f645717456293fc2c963f66afa6-a1b2c3d4e5f60001-01 # ^^ version (always 00) # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ trace-id (128-bit hex) # ^^^^^^^^^^^^^^^^ parent-span-id (64-bit hex) # ^^ flags: 01=sampled, 00=not sampled tracestate: jaeger=a1b2c3d4e5f60001,vendor2=xyz # tracestate: vendor-specific key-value pairs, passed through unchanged # B3 format (Zipkin-originated, still common in older stacks) X-B3-TraceId: 3fa85f645717456293fc2c963f66afa6 X-B3-SpanId: a1b2c3d4e5f60001 X-B3-ParentSpanId: 0000000000000001 X-B3-Sampled: 1 # B3 single-header format b3: 3fa85f645717456293fc2c963f66afa6-a1b2c3d4e5f60001-1-0000000000000001
03

OpenTelemetry

// THE VENDOR-NEUTRAL INSTRUMENTATION STANDARD

OpenTelemetry (OTel) is the CNCF standard for emitting traces, metrics, and logs from applications. It replaces the fragmented ecosystem of Jaeger clients, Zipkin clients, and vendor SDKs with a single, vendor-neutral API and SDK. Instrument once, export to any backend.

The architecture: Your application uses the OTel API (no-op without SDK). The OTel SDK processes and batches telemetry. The OTel Collector (optional but recommended) receives OTLP from services, transforms it, and fans out to multiple backends — Jaeger, Zipkin, Prometheus, Grafana Tempo, Datadog — simultaneously.
Python — OTel SDK Setup + Jaeger Exporter
from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource # 1. Define service resource resource = Resource.create({ "service.name": "recommendations-service", "service.version": "2.4.1", "deployment.environment": "production", "k8s.namespace.name": "prod-api", }) # 2. Configure exporter → OTel Collector (which fans out to Jaeger/Zipkin) exporter = OTLPSpanExporter( endpoint="http://otel-collector.observability.svc:4317", # Or directly to Jaeger: endpoint="http://jaeger-collector:4317" ) # 3. Build provider with batch processor provider = TracerProvider(resource=resource) provider.add_span_processor(BatchSpanProcessor( exporter, max_queue_size=2048, max_export_batch_size=512, export_timeout_millis=30000, )) trace.set_tracer_provider(provider) # 4. Get a tracer — use this throughout your service tracer = trace.get_tracer("recommendations.core", "2.4.1") # 5. Create and annotate spans manually def get_recommendations(user_id: str, context_items: list): with tracer.start_as_current_span("get_recommendations") as span: span.set_attributes({ "user.id": user_id, "context.size": len(context_items), }) try: result = _fetch_recs(user_id, context_items) span.set_attribute("recommendations.returned", len(result)) return result except Exception as e: span.record_exception(e) span.set_status(trace.StatusCode.ERROR, str(e)) raise
TypeScript / Node.js — OTel Auto-Instrumentation
// instrumentation.ts — load BEFORE all other imports (--require flag) import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { Resource } from '@opentelemetry/resources'; import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions'; const sdk = new NodeSDK({ resource: new Resource({ [ATTR_SERVICE_NAME]: 'api-gateway' }), traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4317' }), instrumentations: [ getNodeAutoInstrumentations({ // Auto-instruments: http, express, grpc, pg, redis, mongodb, aws-sdk ... '@opentelemetry/instrumentation-http': { enabled: true }, '@opentelemetry/instrumentation-express': { enabled: true }, '@opentelemetry/instrumentation-pg': { enabled: true }, '@opentelemetry/instrumentation-redis': { enabled: true }, }) ] }); sdk.start(); // → Every HTTP request, DB query, Redis call now generates spans automatically
04

Jaeger

// CNCF GRADUATED — PRODUCTION-GRADE DISTRIBUTED TRACING

Jaeger (pronounced "YAY-ger") is a CNCF graduated project, originally developed by Uber. It is the production standard for large-scale distributed tracing — with support for Cassandra, Elasticsearch, and OpenSearch backends, adaptive sampling, service dependency graphs, and a mature query UI. The v2 architecture unifies ingestion via the OTel Collector.

Jaeger v2 Architecture Current
  • Jaeger Collector — ingests OTLP over gRPC/HTTP, writes to storage
  • Jaeger Query — serves the UI and the Jaeger Query API
  • Storage backends — Elasticsearch, OpenSearch, Cassandra, Badger (dev), ClickHouse (plugin)
  • Sampling service — serves adaptive sampling strategies to agents
Key Features Capabilities
  • Native OTLP ingestion (Jaeger v2+)
  • Adaptive sampling — adjusts per-service rates dynamically
  • Service dependency graph from traces
  • Deep span search with tag filters, duration ranges, error filters
  • Trace comparison (side-by-side waterfall)
  • Grafana data source plugin
Docker Compose — Jaeger All-in-One (Dev)
# dev-tracing.yml — Jaeger all-in-one with in-memory storage (dev/test only) services: jaeger: image: jaegertracing/all-in-one:latest ports: - "16686:16686" # Jaeger UI - "4317:4317" # OTLP gRPC — send traces here - "4318:4318" # OTLP HTTP environment: - COLLECTOR_OTLP_ENABLED=true - SPAN_STORAGE_TYPE=memory - MEMORY_MAX_TRACES=50000
Kubernetes — Jaeger Production via Jaeger Operator
# 1. Install Jaeger Operator kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml -n observability # 2. Deploy Jaeger with Elasticsearch backend apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: production-jaeger namespace: observability spec: strategy: production # Deploys separate collector + query + ES collector: replicas: 3 resources: limits: { cpu: "1", memory: "2Gi" } options: collector: queue-size: 2000 num-workers: 50 query: replicas: 2 metricsPort: 8686 storage: type: elasticsearch options: es: server-urls: https://elasticsearch.observability.svc:9200 index-prefix: jaeger tls.enabled: true num-shards: 5 num-replicas: 1 sampling: options: default_strategy: type: probabilistic param: 0.1 # 10% default; overridden per-service below service_strategies: - service: payment-service type: probabilistic param: 1.0 # 100% — always trace payments - service: analytics-service type: ratelimiting param: 5 # max 5 traces/sec regardless of traffic
05

Zipkin

// THE ORIGINAL — SIMPLE, FAST, BATTLE-TESTED

Zipkin was the first widely-adopted open-source distributed tracing system, created by Twitter in 2012 (based on Google Dapper). It pioneered the B3 propagation format and remains the simplest path to get traces working — especially in Spring Boot ecosystems (Spring Cloud Sleuth has first-class Zipkin support out of the box).

Zipkin Architecture
  • Zipkin Server — single process: collector + storage + UI + API
  • Storage — In-memory (dev), MySQL, Cassandra, Elasticsearch
  • Transport — HTTP (POST /api/v2/spans), Kafka, Scribe
  • Reporter — library-side component that buffers and flushes spans to server
Best Fit For Use Cases
  • Spring Boot services — native integration, zero extra config
  • Smaller deployments preferring simplicity over scalability
  • Existing B3-instrumented stacks being migrated incrementally
  • Teams wanting a minimal single-binary deployment
  • Development/staging environments
Docker — Zipkin Quickstart
# Simplest possible start — in-memory storage docker run -d -p 9411:9411 --name zipkin openzipkin/zipkin # UI: http://localhost:9411 # With Elasticsearch backend services: zipkin: image: openzipkin/zipkin:latest ports: - "9411:9411" environment: - STORAGE_TYPE=elasticsearch - ES_HOSTS=http://elasticsearch:9200 - ES_INDEX=zipkin - ES_INDEX_REPLICAS=1 - ES_INDEX_SHARDS=5 # Spring Boot — application.yaml (zero code changes) management: tracing: sampling: probability: 1.0 # 100% in dev spring: zipkin: base-url: http://zipkin:9411 sender: type: web # or kafka
OTel → Zipkin: The OpenTelemetry Collector can fan out to Zipkin using the zipkinexporter — so you don't need Zipkin-specific SDKs. Instrument with OTel once and export to Zipkin (dev), Jaeger (staging), and Grafana Tempo (production) simultaneously via the Collector pipeline.
06

Jaeger vs Zipkin

// CHOOSING THE RIGHT TOOL
DimensionJaegerZipkin
Maturity CNCF Graduated — production battle-tested at Uber scale Mature — Twitter-born, 12 years production use
Ingest Protocol OTLP-native (v2), Jaeger Thrift, Zipkin B3 Zipkin JSON/Proto, B3; OTel via Collector
Scale High — designed for millions of spans/sec with ES/Cassandra Medium — sufficient for most orgs; ES backend required at scale
UI quality Full-featured: trace comparison, service graph, search Clean, minimal — good for trace lookup, less analytical
Adaptive Sampling YES — per-service, dynamic, server-driven NO — client-side only, manual configuration
Deployment complexity Medium — Operator recommended; more components Low — single binary, trivial to start
Spring Boot integration Via OTel SDK Native — Spring Cloud Sleuth, zero config
Grafana integration First-class data source Via OTel Collector → Grafana Tempo
Choose when Kubernetes, high volume, need adaptive sampling, Grafana stack Spring Boot shop, simplicity priority, getting started fast
⬡ Jaeger Strengths
  • Native OTLP ingestion in v2 — no adapter needed
  • Adaptive sampling adjusts per service automatically
  • Better UI: trace comparison, dependency graph, deep search
  • Kubernetes-native: Jaeger Operator manages the full stack
  • Multi-tenant support via tenant headers
◎ Zipkin Strengths
  • Simpler architecture — one binary to start
  • Spring Boot zero-config integration (Sleuth)
  • Lower operational overhead for smaller teams
  • B3 format ubiquitous in older instrumented stacks
  • Faster to get running from zero
07

8-Service Trace Walkthrough

// FOLLOWING A REQUEST END-TO-END

A user submits a chat message to an AI assistant app. The request traverses 8 microservices, hits 3 LLM API calls, 4 database queries, 2 cache operations, and 1 background queue. Total latency: 3,841ms. Without tracing, you'd never know the LLM orchestrator alone consumed 2,900ms.

TRACE: chat-completion — 3841ms total
api-gateway / POST /chat
api-gateway 3841ms
auth-service / validateJWT
38ms
session-service / getSession
12ms
context-service / loadHistory
db:384ms
rag-service / vectorSearch
embed+search 231ms
↳ llm-router / embedding
ada-002 118ms
llm-orchestrator / complete
3 LLM hops 2900ms
↳ llm-router / classify
gpt-4o-mini 268ms
↳ llm-router / generate
claude-3-5-sonnet 2208ms ← BOTTLENECK
↳ llm-router / postprocess
gpt-4o-mini 192ms
storage-service / saveMessage
postgres 154ms
audit-service / enqueue
kafka 96ms
3841ms
Root / Gateway
Service Call
LLM API Call
Database Query
Cache Hit/Miss
External (Kafka)
What the trace reveals immediately: The llm-router/generate span with claude-3-5-sonnet accounts for 2,208ms — 57% of total latency. Without tracing, you'd see the API is slow but have no idea where to optimize. With the trace, you immediately know: investigate caching LLM responses, streaming, or prompt compression for the generate step.

The Span Tree for This Request

Text — Span Hierarchy (Jaeger UI view)
TRACE: 3fa85f64... Total: 3841ms Service count: 8 Span count: 19 api-gateway [0ms → 3841ms] POST /api/v1/chat 3841ms ├── auth-service [0ms → 38ms] validateJWT 38ms │ └── redis [2ms → 8ms] GET session:token 6ms (cache HIT) ├── session-service [40ms → 52ms] getSession 12ms │ └── postgres [41ms → 50ms] SELECT sessions 9ms ├── context-service [54ms → 438ms] loadHistory 384ms │ ├── postgres [55ms → 210ms] SELECT messages 155ms ← slow query │ └── postgres [211ms → 436ms] SELECT context 225ms ← slow query ├── rag-service [56ms → 287ms] vectorSearch 231ms │ ├── llm-router [58ms → 176ms] embedding 118ms (ada-002, 312 tokens) │ └── qdrant [177ms → 283ms] query 106ms ├── llm-orchestrator [615ms → 3515ms] complete 2900ms │ ├── llm-router [616ms → 884ms] classify 268ms (gpt-4o-mini, 156→43 tokens) │ ├── llm-router [885ms → 3093ms] generate 2208ms ← BOTTLENECK (claude-3-5-sonnet, 2847→612 tokens) │ └── llm-router [3094ms → 3286ms] postprocess 192ms (gpt-4o-mini, 674→124 tokens) ├── storage-service [3520ms → 3674ms] saveMessage 154ms │ └── postgres [3521ms → 3670ms] INSERT messages 149ms └── audit-service [3680ms → 3776ms] enqueueEvent 96ms └── kafka [3682ms → 3774ms] PRODUCE chat.events 92ms
08

Tracing LLM Calls

// OBSERVING AI INFERENCE AS FIRST-CLASS SPANS

LLM API calls are unlike normal HTTP calls. They have unique observability requirements: token counts (prompt vs. completion), model name and version, cost attribution, finish reason (stop/length/content_filter), streaming vs. non-streaming latency, and prompt content (with PII redaction). Each LLM call deserves a rich, dedicated span.

OpenTelemetry GenAI Semantic Conventions: The OTel community has published standardized attribute names for LLM spans under gen_ai.*. Use these to ensure compatibility with observability platforms that understand GenAI semantics: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons.
Python — Instrumenting LLM Calls (OpenAI / Anthropic)
import time from opentelemetry import trace from opentelemetry.trace import SpanKind import anthropic tracer = trace.get_tracer("llm-router") async def call_llm_with_tracing( prompt: str, model: str = "claude-3-5-sonnet-20241022", max_tokens: int = 1024, purpose: str = "generate" ) -> str: client = anthropic.Anthropic() with tracer.start_as_current_span( f"llm.{purpose}", kind=SpanKind.CLIENT ) as span: # OTel GenAI semantic conventions span.set_attributes({ "gen_ai.system": "anthropic", "gen_ai.request.model": model, "gen_ai.request.max_tokens": max_tokens, "gen_ai.request.temperature": 0.7, # Custom: purpose and prompt fingerprint (NOT the full prompt in prod) "llm.purpose": purpose, "llm.prompt.length": len(prompt), "llm.prompt.hash": hash_prompt(prompt), # for dedup }) start = time.monotonic() try: response = client.messages.create( model=model, max_tokens=max_tokens, messages=[{"role": "user", "content": prompt}] ) latency_ms = (time.monotonic() - start) * 1000 # Record response telemetry span.set_attributes({ "gen_ai.usage.input_tokens": response.usage.input_tokens, "gen_ai.usage.output_tokens": response.usage.output_tokens, "gen_ai.response.finish_reasons": [response.stop_reason], "gen_ai.response.id": response.id, # Cost attribution (calculate from token counts × model pricing) "llm.cost.input_usd": response.usage.input_tokens * 3e-6, "llm.cost.output_usd": response.usage.output_tokens * 15e-6, "llm.latency_ms": round(latency_ms, 2), "llm.tokens_per_second": response.usage.output_tokens / (latency_ms / 1000), }) return response.content[0].text except anthropic.APITimeoutError as e: span.set_status(trace.StatusCode.ERROR, "LLM timeout") span.record_exception(e) span.set_attribute("llm.error.type", "timeout") raise except anthropic.RateLimitError as e: span.set_status(trace.StatusCode.ERROR, "rate_limit") span.set_attribute("llm.error.type", "rate_limit") span.record_exception(e) raise

What LLM Span Attributes Enable

Cost Attribution by Request FinOps

Sum llm.cost.input_usd + llm.cost.output_usd across all LLM spans in a trace. Aggregate by user.id, tenant.id, or feature to understand who's driving LLM costs. Build dashboards that break down inference spend by product surface.

Latency Breakdown Performance

Separate TTFT (time-to-first-token) from full generation time with span events. Identify whether latency is dominated by prompt processing or generation. Detect when a model has higher-than-usual latency indicating capacity issues on the provider side.

Prompt Cache Efficiency Optimization

Track gen_ai.usage.cache_read_tokens vs gen_ai.usage.input_tokens. Measure cache hit rate per prompt template. Guide prompt engineering decisions: long system prompts benefit from prefix caching — but only if you can measure the impact.

Failure Mode Analysis Reliability

Track gen_ai.response.finish_reasons across requests. High rate of length (truncation) means your max_tokens is too low. content_filter spikes indicate prompt injection attempts or edge-case inputs needing guardrails.

09

Context Propagation

// PASSING THE BATON ACROSS SERVICES

Context propagation is the mechanism that links spans across service boundaries into a single trace. The calling service injects the current span context into outgoing request headers. The receiving service extracts it and creates its child span with that parent. Without correct propagation, you get disconnected single-service traces.

Python — Manual Propagation (HTTP & Kafka)
from opentelemetry import trace, propagate from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator import httpx from confluent_kafka import Producer propagator = TraceContextTextMapPropagator() tracer = trace.get_tracer("api-gateway") # ── HTTP outbound call ── async def call_downstream(url: str, payload: dict) -> dict: with tracer.start_as_current_span("http.client.call", kind=trace.SpanKind.CLIENT) as span: headers = {} # Inject current span context into HTTP headers propagate.inject(headers) # → headers now contains: {"traceparent": "00-3fa8...-01"} async with httpx.AsyncClient() as client: resp = await client.post(url, json=payload, headers=headers) span.set_attribute("http.response.status_code", resp.status_code) return resp.json() # ── Kafka producer (inject into message headers) ── def publish_event(topic: str, event: dict): kafka_headers = {} propagate.inject(kafka_headers) # Same inject API kafka_header_list = [(k, v.encode()) for k, v in kafka_headers.items()] producer.produce(topic, key=event["id"], value=json.dumps(event).encode(), headers=kafka_header_list) # ── Receiving service — extract context ── from opentelemetry.propagate import extract from fastapi import Request async def my_endpoint(request: Request): # Auto-instrumented with FastAPI middleware — but here's the manual version: ctx = extract(dict(request.headers)) # Extract propagated context with tracer.start_as_current_span( "process_request", context=ctx, # parent = the extracted span context kind=trace.SpanKind.SERVER ) as span: span.set_attribute("user.id", request.headers.get("x-user-id")) ...
Async and queues: Message queues (Kafka, SQS, RabbitMQ) break the synchronous parent-child span relationship. Use span links to connect a producer span to a consumer span when there's no direct causal chain — the consumer span has a link to the producer span, indicating relationship without implying causality. This is the correct model for fanout patterns.
10

Auto vs Manual Instrumentation

// WHAT YOU GET FOR FREE AND WHEN TO GO DEEPER

OTel auto-instrumentation uses bytecode manipulation (Java), monkey-patching (Python), or module wrapping (Node.js) to instrument popular libraries without code changes. It's the right starting point — but insufficient alone for business-meaningful observability.

LayerAuto-InstrumentedManual Required
HTTP YES — all inbound/outbound HTTP spans, status codes, routes Business context: user.id, tenant.id, feature.flag
Databases YES — query spans with statement text, table name, duration Slow query context: row count, index used, affected records
Redis / Memcached YES — command + key pattern, latency Hit/miss rate as span events, cache key namespace
gRPC YES — method, status code, propagation Payload size, retry count, deadline remaining
LLM APIs PARTIAL — openai-python has OTel contrib; Anthropic manual only Token counts, cost, finish reason, model, prompt hash — all manual
Message queues PARTIAL — Kafka, RabbitMQ instrumentations available Consumer lag, queue depth at time of processing, span links
Business logic NO Everything: business operations, user intent, A/B variant, recommendation algorithm used
The 80/20 rule: Auto-instrumentation gives you 80% of trace coverage with 0% code changes. Add manual spans at decision points that matter for debugging: each distinct algorithmic step, each third-party API call not covered by auto-instrumentation, each conditional branch that changes the request path significantly.
11

Sampling Strategies

// NOT EVERY TRACE NEEDS TO LIVE FOREVER

At scale — hundreds of services, thousands of requests per second — tracing 100% of requests is prohibitively expensive. Sampling controls which traces are recorded and stored. The key insight: you need representative coverage of normal traffic, but always need to capture errors, high latency, and critical business flows.

Head-Based Sampling Simple

The sampling decision is made at the start of a request, at the root span. All downstream services inherit the decision via the sampled flag in the traceparent header. Simple to implement, but you decide to sample before you know if the trace will be interesting.

  • Probabilistic — sample X% of all traces
  • Rate-limiting — max N traces/second regardless of volume
  • Always-on for errors — sample 100% when status=5xx
Tail-Based Sampling Advanced

The sampling decision is made after the trace is complete — all spans are buffered, then a policy evaluates the entire trace. Captures slow, erroneous, or unusual traces that head-sampling would discard. Requires a stateful collector (OTel Collector tail-sampling processor, Grafana Tempo).

  • Latency-based — keep all traces where total > 2000ms
  • Error-based — keep all traces containing error spans
  • Attribute-based — always keep VIP user traces
YAML — OTel Collector Tail Sampling Config
# collector-config.yaml processors: tail_sampling: decision_wait: 10s # Wait up to 10s for all spans before deciding num_traces: 50000 # Max traces in memory buffer expected_new_traces_per_sec: 2000 policies: # Policy 1: Always keep errors - name: keep-errors type: status_code status_code: { status_codes: [ERROR] } # Policy 2: Keep slow traces (> 2s) - name: keep-slow-traces type: latency latency: { threshold_ms: 2000 } # Policy 3: Always keep payment traces - name: keep-payments type: string_attribute string_attribute: key: service.name values: [payment-service] enabled_regex_matching: false # Policy 4: Always keep VIP user traces - name: keep-vip-users type: string_attribute string_attribute: key: user.tier values: [enterprise, vip] # Policy 5: Probabilistic 10% of everything else - name: probabilistic-base type: probabilistic probabilistic: { sampling_percentage: 10 } # Composite: keep if ANY of the above match - name: composite-policy type: and and: sub_policy_decision_rule: any
12

Span Attributes

// WHAT TO RECORD AND HOW

Span attributes are the metadata that make a trace actually searchable and debuggable. OTel publishes semantic conventions — standardized attribute names for common operations. Use them to ensure your traces are compatible with tracing backends and APM platforms without custom parsing.

CategoryRequired AttributesOptional
All Services service.name, service.version, deployment.environment k8s.pod.name, k8s.namespace.name, host.id
HTTP Server http.method, http.route, http.response.status_code http.request.body.size, net.peer.ip, http.user_agent
HTTP Client http.method, server.address, http.response.status_code http.request.body.size, net.peer.port
Database db.system, db.name, db.statement (sanitized), db.operation db.sql.table, db.rows_affected, server.address
LLM / GenAI gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens gen_ai.response.finish_reasons, gen_ai.response.id, llm.cost.usd
Messaging messaging.system, messaging.destination.name, messaging.operation messaging.message.body.size, messaging.kafka.offset
Business user.id, tenant.id feature.flag, ab.experiment, user.tier, domain-specific fields
NEVER put PII in span attributes. Spans are stored long-term, indexed, and readable by anyone with access to the tracing backend. Sanitize DB statements (remove parameter values), hash user identifiers, truncate request bodies, and strip authentication headers before adding to spans. Use user.id (opaque), never user.email or user.name.
13

Debugging with Traces

// FROM ALERT TO ROOT CAUSE IN MINUTES

A trace is the shortest path from a user complaint to a root cause. This section describes the five most common debugging workflows and how to execute them in Jaeger or Zipkin.

Workflow 1 — "Request X was slow for user Y" Latency

Search Jaeger: service=api-gateway, tag=user.id:usr_9kx2z, min-duration=2s, last 1h. Click the highest-latency trace. Expand the waterfall. Sort spans by duration. The longest non-root span is your starting point for investigation.

Workflow 2 — "We're getting 5xx errors since deploy" Error

Search: service=order-service, tag=error:true, last 30m. Find a representative error trace. The red (error) span tells you the exact service, span name, and exception. The span events contain the full stack trace.

Workflow 3 — "Which service is causing N+1 queries" Database

Find a trace with high database span count. Filter spans by db.system=postgresql. 30 identical SELECT spans in a loop reveals the N+1 instantly — you can see the parent span that spawned them and the caller code via code.filepath attributes.

Workflow 4 — "Why did this LLM call cost 10× the normal amount" LLM Cost

Find the trace by trace_id from your LLM cost anomaly alert. Inspect the llm.router/generate span. Check gen_ai.usage.input_tokens — was there context accumulation? Check the upstream context-service span for how much history was loaded. The causal chain is visible.

Workflow 5 — "Service mesh is timing out intermittently" Reliability

Search by duration range overlapping the timeout SLA. Compare 5 slow traces against 5 fast traces using Jaeger's compare view. Look for which span is consistently longer in slow traces — retry storms, connection pool exhaustion, and GC pauses all show distinct patterns.

14

Performance Analysis

// READING THE WATERFALL

The trace waterfall is your primary tool for understanding where time goes. Learn to read it fluently: gaps between spans reveal network overhead; spans that start late reveal queuing; parallel spans that collapse into a long sequential chain reveal concurrency bugs.

Common Latency Patterns
A. SEQUENTIAL BLOCKING — should be parallel
fetch_user_profile
200ms
fetch_user_orders
180ms
fetch_recommendations
220ms
→ Total: 600ms | Fix: asyncio.gather() → reduces to ~220ms (longest operation)
B. PARALLEL (CORRECT) — same calls, concurrent
fetch_user_profile
200ms
fetch_user_orders
180ms
fetch_recommendations
220ms
→ Total: 220ms | 73% faster
C. GAPS — network/scheduling overhead
parent_service call
800ms
child span (actual)
440ms
→ Gap before child (18%) = DNS + TCP + TLS handshake = 144ms overhead
→ Gap after child (38%) = response processing time not instrumented
Connection pool exhaustion: If you see a pattern where a child span starts late — large gap between parent start and child start — and multiple requests share the same pattern at the same time, suspect connection pool exhaustion. The request is waiting for a database connection, not for the database to respond. Add db.wait_ms as a span event to confirm.
15

Deployment & Storage

// RUNNING TRACING INFRASTRUCTURE AT SCALE
Storage Backend Selection
BackendScaleCost
In-memoryDev onlyFree
BadgerSingle nodeFree
ElasticsearchHighMedium
OpenSearchHighMedium
CassandraVery HighHigh ops
ClickHouseHighLow
Grafana TempoCloud-scaleObject store
OTel Collector — Production Config
  • Run Collector as a DaemonSet (one per node) for local span collection — low latency, no network hops from pods
  • Or as a Deployment (central gateway) for batching, fan-out, and tail sampling
  • Use both: DaemonSet forwards to a central gateway Deployment that does tail sampling
  • Size memory: 2–4GB per collector at 10k spans/sec with tail sampling enabled
  • Always set memory_limiter processor to prevent OOM under spike
YAML — OTel Collector Production Pipeline
# collector-production.yaml receivers: otlp: protocols: grpc: { endpoint: "0.0.0.0:4317" } http: { endpoint: "0.0.0.0:4318" } processors: memory_limiter: limit_mib: 3000 spike_limit_mib: 500 check_interval: 5s batch: timeout: 5s send_batch_size: 1024 send_batch_max_size: 2048 attributes/sanitize: # Remove PII before storage actions: - key: db.statement action: update value: "[sanitized]" regexp: '(?i)(password|secret|token)\s*=\s*\S+' - key: http.request.header.authorization action: delete tail_sampling: # Keep errors + slow traces decision_wait: 10s policies: - name: errors type: status_code status_code: { status_codes: [ERROR] } - name: slow type: latency latency: { threshold_ms: 2000 } - name: base type: probabilistic probabilistic: { sampling_percentage: 5 } exporters: otlp/jaeger: endpoint: jaeger-collector.observability.svc:4317 tls: { insecure: false, ca_file: /certs/ca.pem } zipkin: # Fan out to Zipkin for legacy consumers endpoint: http://zipkin.observability.svc:9411/api/v2/spans otlp/tempo: # And Grafana Tempo for long-term storage endpoint: tempo.observability.svc:4317 service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, attributes/sanitize, batch, tail_sampling] exporters: [otlp/jaeger, zipkin, otlp/tempo]
16

Tracing Maturity Model

// WHERE ARE YOU TODAY

Most organizations start with ad-hoc traces on a few services and build toward full-system observability. This model shows the progression. Focus on getting all services to Level 2 before perfecting any one service at Level 4.

Level 01
Blind
  • No distributed tracing
  • Log correlation by request ID only
  • Manual service topology diagrams
  • Latency debugging = log trawling
  • No cross-service visibility
Level 02
Aware
  • OTel auto-instrumentation on HTTP
  • Traces visible in Jaeger/Zipkin
  • Basic sampling (probabilistic)
  • Some services missing context propagation
  • No LLM / queue tracing
Level 03
Operational
  • All services traced, propagation correct
  • DB, cache, queue spans included
  • Tail-based sampling for errors/slowness
  • Business attributes on key spans
  • OTel Collector pipeline in production
Level 04
Optimal
  • LLM calls as rich spans with tokens/cost
  • Span links for async/queue flows
  • Traces correlated with logs & metrics
  • Adaptive sampling per-service
  • Automated anomaly detection on traces
Reference reading: Google Dapper paper (2010) — the theoretical foundation. OpenTelemetry docs at opentelemetry.io. Jaeger documentation at jaegertracing.io. W3C TraceContext spec at w3.org/TR/trace-context. OTel GenAI semantic conventions at opentelemetry.io/docs/specs/semconv/gen-ai. Grafana Tempo for cost-efficient trace storage at object-store scale.