DTH

Field Handbook

Distributed
Tracing

// "If you can't see the path, you can't fix the bottleneck."

The operational guide to observing requests as they traverse microservices, databases, queues, and LLM APIs. Covers OpenTelemetry instrumentation, Jaeger and Zipkin deployment, context propagation, sampling strategy, and end-to-end debugging patterns.

Jaeger Zipkin OpenTelemetry LLM Tracing W3C TraceContext Debugging

Why Distributed Tracing

// THE OBSERVABILITY GAP

In a monolith, a slow request is obvious — profile it, find the slow function, fix it. In a distributed system spanning dozens of microservices, the request is a ghost: it enters your API gateway and touches 8 services, 3 databases, 2 queues, and 3 LLM API calls before a response returns. Which hop added the 4 seconds? Logs can't tell you. Metrics can't tell you. Traces can.

Distributed tracing reconstructs the complete journey of a request as a causally-linked tree of spans — one per operation, each with a start time, duration, service name, and any relevant metadata. It answers the questions that logs and metrics fundamentally cannot.

Logs — What Happened

Discrete events per service. No cross-service correlation. You need a request ID to join logs across services — and you still can't see the timing relationships.

Metrics — How Much

Aggregated rates and latencies per service. No per-request view. You know the p99 of the order service is 2s but not which downstream call caused it for which request.

Traces — Why & Where

Per-request causal graph of every operation. Exactly which service, which database query, which LLM call caused the slowdown — for a specific user's specific request.

◉

The three pillars of observability: Logs, metrics, and traces are complementary — not alternatives. Traces tell you where the problem is. Metrics tell you it's happening in aggregate. Logs tell you what data was involved. All three should be correlated: traces enriched with log links, metrics tagged with exemplar trace IDs.

What Tracing Solves That Nothing Else Does

Latency Attribution Critical

Which downstream dependency added 3.8 seconds to this specific request? Was it the database query in service-6, or the LLM call in service-4? Traces give you the exact span with its duration, parent, and all child spans — in one view.

Error Root Cause Debug

A 500 from the API gateway tells you nothing. A trace shows the error originated as a timeout in the recommendations service calling a third-party embedding API — 7 hops deep. Navigate directly to the failing span.

Dependency Mapping Topology

Tracing backends automatically derive your actual service dependency graph from real traffic. No manual documentation required. Discover undocumented dependencies and fanout patterns your architecture diagrams don't show.

LLM Cost & Latency Visibility AI/ML

LLM API calls (OpenAI, Anthropic, local models) are the dominant latency and cost contributor in AI-augmented services. Trace each LLM call as a span: model, token count, latency, cost, prompt hash. Correlate LLM latency spikes with user-facing degradation.

Core Concepts

// TRACES, SPANS, CONTEXT

Distributed tracing has a small, precise vocabulary. Understanding these five concepts is sufficient to work with any tracing system — OpenTelemetry, Jaeger, Zipkin, or a commercial APM.

Concept	Definition	Analogy
Trace	The complete end-to-end record of a single request as it flows through a distributed system. A tree of spans sharing the same `trace_id`.	The full itinerary of a package shipment
Span	A single unit of work within a trace. Has a name, `span_id`, parent `span_id`, start timestamp, duration, status, and optional attributes/events.	One leg of the shipment journey (city → city)
Trace Context	The propagated headers (`traceparent`, `tracestate`) that link spans across service boundaries. Injected by the caller, extracted by the callee.	The tracking number printed on the label
SpanContext	The immutable, serializable identity of a span: `trace_id` (128-bit), `span_id` (64-bit), trace flags, trace state. Propagated in-process and cross-process.	The barcode on the package
Baggage	Key-value pairs propagated with the trace context across service boundaries. Useful for passing user ID, tenant, feature flag, or correlation metadata without modifying every service's API.	Notes written on the shipping label

Span Anatomy

JSON — Span Structure (OTLP)

{
  "traceId":       "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "spanId":        "a1b2c3d4e5f60001",
  "parentSpanId":  "0000000000000001",       // absent on root span
  "name":          "POST /v1/recommendations",

  "kind":  "SERVER",  // SERVER | CLIENT | PRODUCER | CONSUMER | INTERNAL
  "startTimeUnixNano": 1715000000000000000,
  "endTimeUnixNano":   1715000000843000000,     // 843ms

  "status": { "code": "OK" },   // OK | ERROR | UNSET

  "attributes": {
    // HTTP semantic conventions (OpenTelemetry semconv)
    "http.method":          "POST",
    "http.route":           "/v1/recommendations",
    "http.response.status_code": 200,
    "server.address":       "recommendations.svc.cluster.local",
    "http.request.body.size": 1024,

    // Custom business attributes
    "user.id":              "usr_9kx2z",
    "recommendation.count": 12,
    "model.used":           "colbert-v2"
  },

  "events": [
    {
      "name": "cache.miss",
      "timeUnixNano": 1715000000012000000,
      "attributes": { "cache.key": "recs:usr_9kx2z:v2" }
    },
    {
      "name": "embedding.computed",
      "timeUnixNano": 1715000000524000000,
      "attributes": { "embedding.dims": 768, "embedding.model": "text-embedding-ada-002" }
    }
  ],

  "links": [
    {
      // Link to async span in a queue processor — cross-trace relationship
      "traceId": "9c4fd3a1...",
      "spanId":  "b7c2e4f8...",
      "attributes": { "link.type": "enqueued_by" }
    }
  ]
}

W3C TraceContext Headers

HTTP Headers — W3C TraceContext (RFC)

# The standard propagation format — use this over Zipkin B3 for new systems

traceparent: 00-3fa85f645717456293fc2c963f66afa6-a1b2c3d4e5f60001-01
#            ^^ version (always 00)
#               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ trace-id (128-bit hex)
#                                                ^^^^^^^^^^^^^^^^ parent-span-id (64-bit hex)
#                                                                 ^^ flags: 01=sampled, 00=not sampled

tracestate: jaeger=a1b2c3d4e5f60001,vendor2=xyz
# tracestate: vendor-specific key-value pairs, passed through unchanged

# B3 format (Zipkin-originated, still common in older stacks)
X-B3-TraceId:      3fa85f645717456293fc2c963f66afa6
X-B3-SpanId:       a1b2c3d4e5f60001
X-B3-ParentSpanId: 0000000000000001
X-B3-Sampled:      1

# B3 single-header format
b3: 3fa85f645717456293fc2c963f66afa6-a1b2c3d4e5f60001-1-0000000000000001

OpenTelemetry

// THE VENDOR-NEUTRAL INSTRUMENTATION STANDARD

OpenTelemetry (OTel) is the CNCF standard for emitting traces, metrics, and logs from applications. It replaces the fragmented ecosystem of Jaeger clients, Zipkin clients, and vendor SDKs with a single, vendor-neutral API and SDK. Instrument once, export to any backend.

▸

The architecture: Your application uses the OTel API (no-op without SDK). The OTel SDK processes and batches telemetry. The OTel Collector (optional but recommended) receives OTLP from services, transforms it, and fans out to multiple backends — Jaeger, Zipkin, Prometheus, Grafana Tempo, Datadog — simultaneously.

Python — OTel SDK Setup + Jaeger Exporter

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# 1. Define service resource
resource = Resource.create({
    "service.name":      "recommendations-service",
    "service.version":   "2.4.1",
    "deployment.environment": "production",
    "k8s.namespace.name": "prod-api",
})

# 2. Configure exporter → OTel Collector (which fans out to Jaeger/Zipkin)
exporter = OTLPSpanExporter(
    endpoint="http://otel-collector.observability.svc:4317",
    # Or directly to Jaeger: endpoint="http://jaeger-collector:4317"
)

# 3. Build provider with batch processor
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
    exporter,
    max_queue_size=2048,
    max_export_batch_size=512,
    export_timeout_millis=30000,
))
trace.set_tracer_provider(provider)

# 4. Get a tracer — use this throughout your service
tracer = trace.get_tracer("recommendations.core", "2.4.1")

# 5. Create and annotate spans manually
def get_recommendations(user_id: str, context_items: list):
    with tracer.start_as_current_span("get_recommendations") as span:
        span.set_attributes({
            "user.id":     user_id,
            "context.size": len(context_items),
        })
        try:
            result = _fetch_recs(user_id, context_items)
            span.set_attribute("recommendations.returned", len(result))
            return result
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise

TypeScript / Node.js — OTel Auto-Instrumentation

// instrumentation.ts — load BEFORE all other imports (--require flag)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({ [ATTR_SERVICE_NAME]: 'api-gateway' }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317'
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments: http, express, grpc, pg, redis, mongodb, aws-sdk ...
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    })
  ]
});
sdk.start();
// → Every HTTP request, DB query, Redis call now generates spans automatically

Jaeger

// CNCF GRADUATED — PRODUCTION-GRADE DISTRIBUTED TRACING

Jaeger (pronounced "YAY-ger") is a CNCF graduated project, originally developed by Uber. It is the production standard for large-scale distributed tracing — with support for Cassandra, Elasticsearch, and OpenSearch backends, adaptive sampling, service dependency graphs, and a mature query UI. The v2 architecture unifies ingestion via the OTel Collector.

Jaeger v2 Architecture Current

Jaeger Collector — ingests OTLP over gRPC/HTTP, writes to storage
Jaeger Query — serves the UI and the Jaeger Query API
Storage backends — Elasticsearch, OpenSearch, Cassandra, Badger (dev), ClickHouse (plugin)
Sampling service — serves adaptive sampling strategies to agents

Key Features Capabilities

Native OTLP ingestion (Jaeger v2+)
Adaptive sampling — adjusts per-service rates dynamically
Service dependency graph from traces
Deep span search with tag filters, duration ranges, error filters
Trace comparison (side-by-side waterfall)
Grafana data source plugin

Docker Compose — Jaeger All-in-One (Dev)

# dev-tracing.yml — Jaeger all-in-one with in-memory storage (dev/test only)
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC — send traces here
      - "4318:4318"     # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=memory
      - MEMORY_MAX_TRACES=50000

Kubernetes — Jaeger Production via Jaeger Operator

# 1. Install Jaeger Operator
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml -n observability

# 2. Deploy Jaeger with Elasticsearch backend
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
  namespace: observability
spec:
  strategy: production   # Deploys separate collector + query + ES

  collector:
    replicas: 3
    resources:
      limits: { cpu: "1", memory: "2Gi" }
    options:
      collector:
        queue-size: 2000
        num-workers: 50

  query:
    replicas: 2
    metricsPort: 8686

  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch.observability.svc:9200
        index-prefix: jaeger
        tls.enabled: true
        num-shards: 5
        num-replicas: 1

  sampling:
    options:
      default_strategy:
        type: probabilistic
        param: 0.1   # 10% default; overridden per-service below
      service_strategies:
        - service: payment-service
          type: probabilistic
          param: 1.0   # 100% — always trace payments
        - service: analytics-service
          type: ratelimiting
          param: 5     # max 5 traces/sec regardless of traffic

Zipkin

// THE ORIGINAL — SIMPLE, FAST, BATTLE-TESTED

Zipkin was the first widely-adopted open-source distributed tracing system, created by Twitter in 2012 (based on Google Dapper). It pioneered the B3 propagation format and remains the simplest path to get traces working — especially in Spring Boot ecosystems (Spring Cloud Sleuth has first-class Zipkin support out of the box).

Zipkin Architecture

Zipkin Server — single process: collector + storage + UI + API
Storage — In-memory (dev), MySQL, Cassandra, Elasticsearch
Transport — HTTP (POST /api/v2/spans), Kafka, Scribe
Reporter — library-side component that buffers and flushes spans to server

Best Fit For Use Cases

Spring Boot services — native integration, zero extra config
Smaller deployments preferring simplicity over scalability
Existing B3-instrumented stacks being migrated incrementally
Teams wanting a minimal single-binary deployment
Development/staging environments

Docker — Zipkin Quickstart

# Simplest possible start — in-memory storage
docker run -d -p 9411:9411 --name zipkin openzipkin/zipkin
# UI: http://localhost:9411

# With Elasticsearch backend
services:
  zipkin:
    image: openzipkin/zipkin:latest
    ports:
      - "9411:9411"
    environment:
      - STORAGE_TYPE=elasticsearch
      - ES_HOSTS=http://elasticsearch:9200
      - ES_INDEX=zipkin
      - ES_INDEX_REPLICAS=1
      - ES_INDEX_SHARDS=5

# Spring Boot — application.yaml (zero code changes)
management:
  tracing:
    sampling:
      probability: 1.0   # 100% in dev

spring:
  zipkin:
    base-url: http://zipkin:9411
    sender:
      type: web  # or kafka

◎

OTel → Zipkin: The OpenTelemetry Collector can fan out to Zipkin using the zipkinexporter — so you don't need Zipkin-specific SDKs. Instrument with OTel once and export to Zipkin (dev), Jaeger (staging), and Grafana Tempo (production) simultaneously via the Collector pipeline.

Jaeger vs Zipkin

// CHOOSING THE RIGHT TOOL

Dimension	Jaeger	Zipkin
Maturity	CNCF Graduated — production battle-tested at Uber scale	Mature — Twitter-born, 12 years production use
Ingest Protocol	OTLP-native (v2), Jaeger Thrift, Zipkin B3	Zipkin JSON/Proto, B3; OTel via Collector
Scale	High — designed for millions of spans/sec with ES/Cassandra	Medium — sufficient for most orgs; ES backend required at scale
UI quality	Full-featured: trace comparison, service graph, search	Clean, minimal — good for trace lookup, less analytical
Adaptive Sampling	YES — per-service, dynamic, server-driven	NO — client-side only, manual configuration
Deployment complexity	Medium — Operator recommended; more components	Low — single binary, trivial to start
Spring Boot integration	Via OTel SDK	Native — Spring Cloud Sleuth, zero config
Grafana integration	First-class data source	Via OTel Collector → Grafana Tempo
Choose when	Kubernetes, high volume, need adaptive sampling, Grafana stack	Spring Boot shop, simplicity priority, getting started fast

⬡ Jaeger Strengths

Native OTLP ingestion in v2 — no adapter needed
Adaptive sampling adjusts per service automatically
Better UI: trace comparison, dependency graph, deep search
Kubernetes-native: Jaeger Operator manages the full stack
Multi-tenant support via tenant headers

◎ Zipkin Strengths

Simpler architecture — one binary to start
Spring Boot zero-config integration (Sleuth)
Lower operational overhead for smaller teams
B3 format ubiquitous in older instrumented stacks
Faster to get running from zero

8-Service Trace Walkthrough

// FOLLOWING A REQUEST END-TO-END

A user submits a chat message to an AI assistant app. The request traverses 8 microservices, hits 3 LLM API calls, 4 database queries, 2 cache operations, and 1 background queue. Total latency: 3,841ms. Without tracing, you'd never know the LLM orchestrator alone consumed 2,900ms.

TRACE: chat-completion — 3841ms total

api-gateway / POST /chat

api-gateway 3841ms

auth-service / validateJWT

38ms

session-service / getSession

12ms

context-service / loadHistory

db:384ms

rag-service / vectorSearch

embed+search 231ms

↳ llm-router / embedding

ada-002 118ms

llm-orchestrator / complete

3 LLM hops 2900ms

↳ llm-router / classify

gpt-4o-mini 268ms

↳ llm-router / generate

claude-3-5-sonnet 2208ms ← BOTTLENECK

↳ llm-router / postprocess

gpt-4o-mini 192ms

storage-service / saveMessage

postgres 154ms

audit-service / enqueue

kafka 96ms

3841ms

Root / Gateway

Service Call

LLM API Call

Database Query

Cache Hit/Miss

External (Kafka)

▸

What the trace reveals immediately: The llm-router/generate span with claude-3-5-sonnet accounts for 2,208ms — 57% of total latency. Without tracing, you'd see the API is slow but have no idea where to optimize. With the trace, you immediately know: investigate caching LLM responses, streaming, or prompt compression for the generate step.

The Span Tree for This Request

Text — Span Hierarchy (Jaeger UI view)

TRACE: 3fa85f64...  Total: 3841ms   Service count: 8   Span count: 19

api-gateway [0ms → 3841ms] POST /api/v1/chat   3841ms
├── auth-service [0ms → 38ms] validateJWT       38ms
│   └── redis [2ms → 8ms] GET session:token   6ms   (cache HIT)
├── session-service [40ms → 52ms] getSession     12ms
│   └── postgres [41ms → 50ms] SELECT sessions  9ms
├── context-service [54ms → 438ms] loadHistory   384ms
│   ├── postgres [55ms → 210ms] SELECT messages  155ms  ← slow query
│   └── postgres [211ms → 436ms] SELECT context  225ms  ← slow query
├── rag-service [56ms → 287ms] vectorSearch      231ms
│   ├── llm-router [58ms → 176ms] embedding       118ms   (ada-002, 312 tokens)
│   └── qdrant [177ms → 283ms] query            106ms
├── llm-orchestrator [615ms → 3515ms] complete   2900ms
│   ├── llm-router [616ms → 884ms] classify        268ms   (gpt-4o-mini, 156→43 tokens)
│   ├── llm-router [885ms → 3093ms] generate       2208ms  ← BOTTLENECK (claude-3-5-sonnet, 2847→612 tokens)
│   └── llm-router [3094ms → 3286ms] postprocess    192ms   (gpt-4o-mini, 674→124 tokens)
├── storage-service [3520ms → 3674ms] saveMessage  154ms
│   └── postgres [3521ms → 3670ms] INSERT messages 149ms
└── audit-service [3680ms → 3776ms] enqueueEvent   96ms
    └── kafka [3682ms → 3774ms] PRODUCE chat.events  92ms

Tracing LLM Calls

// OBSERVING AI INFERENCE AS FIRST-CLASS SPANS

LLM API calls are unlike normal HTTP calls. They have unique observability requirements: token counts (prompt vs. completion), model name and version, cost attribution, finish reason (stop/length/content_filter), streaming vs. non-streaming latency, and prompt content (with PII redaction). Each LLM call deserves a rich, dedicated span.

◉

OpenTelemetry GenAI Semantic Conventions: The OTel community has published standardized attribute names for LLM spans under gen_ai.*. Use these to ensure compatibility with observability platforms that understand GenAI semantics: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons.

Python — Instrumenting LLM Calls (OpenAI / Anthropic)

import time
from opentelemetry import trace
from opentelemetry.trace import SpanKind
import anthropic

tracer = trace.get_tracer("llm-router")

async def call_llm_with_tracing(
    prompt: str, model: str = "claude-3-5-sonnet-20241022",
    max_tokens: int = 1024, purpose: str = "generate"
) -> str:
    client = anthropic.Anthropic()

    with tracer.start_as_current_span(
        f"llm.{purpose}",
        kind=SpanKind.CLIENT
    ) as span:
        # OTel GenAI semantic conventions
        span.set_attributes({
            "gen_ai.system":              "anthropic",
            "gen_ai.request.model":       model,
            "gen_ai.request.max_tokens":  max_tokens,
            "gen_ai.request.temperature": 0.7,

            # Custom: purpose and prompt fingerprint (NOT the full prompt in prod)
            "llm.purpose":                purpose,
            "llm.prompt.length":           len(prompt),
            "llm.prompt.hash":             hash_prompt(prompt),  # for dedup
        })

        start = time.monotonic()
        try:
            response = client.messages.create(
                model=model, max_tokens=max_tokens,
                messages=[{"role": "user", "content": prompt}]
            )

            latency_ms = (time.monotonic() - start) * 1000

            # Record response telemetry
            span.set_attributes({
                "gen_ai.usage.input_tokens":     response.usage.input_tokens,
                "gen_ai.usage.output_tokens":    response.usage.output_tokens,
                "gen_ai.response.finish_reasons": [response.stop_reason],
                "gen_ai.response.id":            response.id,

                # Cost attribution (calculate from token counts × model pricing)
                "llm.cost.input_usd":   response.usage.input_tokens * 3e-6,
                "llm.cost.output_usd":  response.usage.output_tokens * 15e-6,
                "llm.latency_ms":       round(latency_ms, 2),
                "llm.tokens_per_second": response.usage.output_tokens / (latency_ms / 1000),
            })
            return response.content[0].text

        except anthropic.APITimeoutError as e:
            span.set_status(trace.StatusCode.ERROR, "LLM timeout")
            span.record_exception(e)
            span.set_attribute("llm.error.type", "timeout")
            raise
        except anthropic.RateLimitError as e:
            span.set_status(trace.StatusCode.ERROR, "rate_limit")
            span.set_attribute("llm.error.type", "rate_limit")
            span.record_exception(e)
            raise

What LLM Span Attributes Enable

Cost Attribution by Request FinOps

Sum llm.cost.input_usd + llm.cost.output_usd across all LLM spans in a trace. Aggregate by user.id, tenant.id, or feature to understand who's driving LLM costs. Build dashboards that break down inference spend by product surface.

Latency Breakdown Performance

Separate TTFT (time-to-first-token) from full generation time with span events. Identify whether latency is dominated by prompt processing or generation. Detect when a model has higher-than-usual latency indicating capacity issues on the provider side.

Prompt Cache Efficiency Optimization

Track gen_ai.usage.cache_read_tokens vs gen_ai.usage.input_tokens. Measure cache hit rate per prompt template. Guide prompt engineering decisions: long system prompts benefit from prefix caching — but only if you can measure the impact.

Failure Mode Analysis Reliability

Track gen_ai.response.finish_reasons across requests. High rate of length (truncation) means your max_tokens is too low. content_filter spikes indicate prompt injection attempts or edge-case inputs needing guardrails.

Context Propagation

// PASSING THE BATON ACROSS SERVICES

Context propagation is the mechanism that links spans across service boundaries into a single trace. The calling service injects the current span context into outgoing request headers. The receiving service extracts it and creates its child span with that parent. Without correct propagation, you get disconnected single-service traces.

Python — Manual Propagation (HTTP & Kafka)

from opentelemetry import trace, propagate
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
import httpx
from confluent_kafka import Producer

propagator = TraceContextTextMapPropagator()
tracer = trace.get_tracer("api-gateway")

# ── HTTP outbound call ──
async def call_downstream(url: str, payload: dict) -> dict:
    with tracer.start_as_current_span("http.client.call", kind=trace.SpanKind.CLIENT) as span:
        headers = {}
        # Inject current span context into HTTP headers
        propagate.inject(headers)
        # → headers now contains: {"traceparent": "00-3fa8...-01"}

        async with httpx.AsyncClient() as client:
            resp = await client.post(url, json=payload, headers=headers)
            span.set_attribute("http.response.status_code", resp.status_code)
            return resp.json()

# ── Kafka producer (inject into message headers) ──
def publish_event(topic: str, event: dict):
    kafka_headers = {}
    propagate.inject(kafka_headers)   # Same inject API
    kafka_header_list = [(k, v.encode()) for k, v in kafka_headers.items()]

    producer.produce(topic, key=event["id"],
                     value=json.dumps(event).encode(),
                     headers=kafka_header_list)

# ── Receiving service — extract context ──
from opentelemetry.propagate import extract
from fastapi import Request

async def my_endpoint(request: Request):
    # Auto-instrumented with FastAPI middleware — but here's the manual version:
    ctx = extract(dict(request.headers))  # Extract propagated context
    with tracer.start_as_current_span(
        "process_request",
        context=ctx,           # parent = the extracted span context
        kind=trace.SpanKind.SERVER
    ) as span:
        span.set_attribute("user.id", request.headers.get("x-user-id"))
        ...

◎

Async and queues: Message queues (Kafka, SQS, RabbitMQ) break the synchronous parent-child span relationship. Use span links to connect a producer span to a consumer span when there's no direct causal chain — the consumer span has a link to the producer span, indicating relationship without implying causality. This is the correct model for fanout patterns.

Auto vs Manual Instrumentation

// WHAT YOU GET FOR FREE AND WHEN TO GO DEEPER

OTel auto-instrumentation uses bytecode manipulation (Java), monkey-patching (Python), or module wrapping (Node.js) to instrument popular libraries without code changes. It's the right starting point — but insufficient alone for business-meaningful observability.

Layer	Auto-Instrumented	Manual Required
HTTP	YES — all inbound/outbound HTTP spans, status codes, routes	Business context: `user.id`, `tenant.id`, `feature.flag`
Databases	YES — query spans with statement text, table name, duration	Slow query context: row count, index used, affected records
Redis / Memcached	YES — command + key pattern, latency	Hit/miss rate as span events, cache key namespace
gRPC	YES — method, status code, propagation	Payload size, retry count, deadline remaining
LLM APIs	PARTIAL — openai-python has OTel contrib; Anthropic manual only	Token counts, cost, finish reason, model, prompt hash — all manual
Message queues	PARTIAL — Kafka, RabbitMQ instrumentations available	Consumer lag, queue depth at time of processing, span links
Business logic	NO	Everything: business operations, user intent, A/B variant, recommendation algorithm used

▸

The 80/20 rule: Auto-instrumentation gives you 80% of trace coverage with 0% code changes. Add manual spans at decision points that matter for debugging: each distinct algorithmic step, each third-party API call not covered by auto-instrumentation, each conditional branch that changes the request path significantly.

Sampling Strategies

// NOT EVERY TRACE NEEDS TO LIVE FOREVER

At scale — hundreds of services, thousands of requests per second — tracing 100% of requests is prohibitively expensive. Sampling controls which traces are recorded and stored. The key insight: you need representative coverage of normal traffic, but always need to capture errors, high latency, and critical business flows.

Head-Based Sampling Simple

The sampling decision is made at the start of a request, at the root span. All downstream services inherit the decision via the sampled flag in the traceparent header. Simple to implement, but you decide to sample before you know if the trace will be interesting.

Probabilistic — sample X% of all traces
Rate-limiting — max N traces/second regardless of volume
Always-on for errors — sample 100% when status=5xx

Tail-Based Sampling Advanced

The sampling decision is made after the trace is complete — all spans are buffered, then a policy evaluates the entire trace. Captures slow, erroneous, or unusual traces that head-sampling would discard. Requires a stateful collector (OTel Collector tail-sampling processor, Grafana Tempo).

Latency-based — keep all traces where total > 2000ms
Error-based — keep all traces containing error spans
Attribute-based — always keep VIP user traces

YAML — OTel Collector Tail Sampling Config

# collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s       # Wait up to 10s for all spans before deciding
    num_traces: 50000        # Max traces in memory buffer
    expected_new_traces_per_sec: 2000

    policies:
      # Policy 1: Always keep errors
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }

      # Policy 2: Keep slow traces (> 2s)
      - name: keep-slow-traces
        type: latency
        latency: { threshold_ms: 2000 }

      # Policy 3: Always keep payment traces
      - name: keep-payments
        type: string_attribute
        string_attribute:
          key: service.name
          values: [payment-service]
          enabled_regex_matching: false

      # Policy 4: Always keep VIP user traces
      - name: keep-vip-users
        type: string_attribute
        string_attribute:
          key: user.tier
          values: [enterprise, vip]

      # Policy 5: Probabilistic 10% of everything else
      - name: probabilistic-base
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

      # Composite: keep if ANY of the above match
      - name: composite-policy
        type: and
        and:
          sub_policy_decision_rule: any

Span Attributes

// WHAT TO RECORD AND HOW

Span attributes are the metadata that make a trace actually searchable and debuggable. OTel publishes semantic conventions — standardized attribute names for common operations. Use them to ensure your traces are compatible with tracing backends and APM platforms without custom parsing.

Category	Required Attributes	Optional
All Services	`service.name`, `service.version`, `deployment.environment`	`k8s.pod.name`, `k8s.namespace.name`, `host.id`
HTTP Server	`http.method`, `http.route`, `http.response.status_code`	`http.request.body.size`, `net.peer.ip`, `http.user_agent`
HTTP Client	`http.method`, `server.address`, `http.response.status_code`	`http.request.body.size`, `net.peer.port`
Database	`db.system`, `db.name`, `db.statement` (sanitized), `db.operation`	`db.sql.table`, `db.rows_affected`, `server.address`
LLM / GenAI	`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`	`gen_ai.response.finish_reasons`, `gen_ai.response.id`, `llm.cost.usd`
Messaging	`messaging.system`, `messaging.destination.name`, `messaging.operation`	`messaging.message.body.size`, `messaging.kafka.offset`
Business	`user.id`, `tenant.id`	`feature.flag`, `ab.experiment`, `user.tier`, domain-specific fields

⚠

NEVER put PII in span attributes. Spans are stored long-term, indexed, and readable by anyone with access to the tracing backend. Sanitize DB statements (remove parameter values), hash user identifiers, truncate request bodies, and strip authentication headers before adding to spans. Use user.id (opaque), never user.email or user.name.

Debugging with Traces

// FROM ALERT TO ROOT CAUSE IN MINUTES

A trace is the shortest path from a user complaint to a root cause. This section describes the five most common debugging workflows and how to execute them in Jaeger or Zipkin.

Workflow 1 — "Request X was slow for user Y" Latency

Search Jaeger: service=api-gateway, tag=user.id:usr_9kx2z, min-duration=2s, last 1h. Click the highest-latency trace. Expand the waterfall. Sort spans by duration. The longest non-root span is your starting point for investigation.

Workflow 2 — "We're getting 5xx errors since deploy" Error

Search: service=order-service, tag=error:true, last 30m. Find a representative error trace. The red (error) span tells you the exact service, span name, and exception. The span events contain the full stack trace.

Workflow 3 — "Which service is causing N+1 queries" Database

Find a trace with high database span count. Filter spans by db.system=postgresql. 30 identical SELECT spans in a loop reveals the N+1 instantly — you can see the parent span that spawned them and the caller code via code.filepath attributes.

Workflow 4 — "Why did this LLM call cost 10× the normal amount" LLM Cost

Find the trace by trace_id from your LLM cost anomaly alert. Inspect the llm.router/generate span. Check gen_ai.usage.input_tokens — was there context accumulation? Check the upstream context-service span for how much history was loaded. The causal chain is visible.

Workflow 5 — "Service mesh is timing out intermittently" Reliability

Search by duration range overlapping the timeout SLA. Compare 5 slow traces against 5 fast traces using Jaeger's compare view. Look for which span is consistently longer in slow traces — retry storms, connection pool exhaustion, and GC pauses all show distinct patterns.

Performance Analysis

// READING THE WATERFALL

The trace waterfall is your primary tool for understanding where time goes. Learn to read it fluently: gaps between spans reveal network overhead; spans that start late reveal queuing; parallel spans that collapse into a long sequential chain reveal concurrency bugs.

Common Latency Patterns

A. SEQUENTIAL BLOCKING — should be parallel

fetch_user_profile

200ms

fetch_user_orders

180ms

fetch_recommendations

220ms

→ Total: 600ms | Fix: asyncio.gather() → reduces to ~220ms (longest operation)

B. PARALLEL (CORRECT) — same calls, concurrent

fetch_user_profile

200ms

fetch_user_orders

180ms

fetch_recommendations

220ms

→ Total: 220ms | 73% faster

C. GAPS — network/scheduling overhead

parent_service call

800ms

child span (actual)

440ms

→ Gap before child (18%) = DNS + TCP + TLS handshake = 144ms overhead

→ Gap after child (38%) = response processing time not instrumented

◉

Connection pool exhaustion: If you see a pattern where a child span starts late — large gap between parent start and child start — and multiple requests share the same pattern at the same time, suspect connection pool exhaustion. The request is waiting for a database connection, not for the database to respond. Add db.wait_ms as a span event to confirm.

Deployment & Storage

// RUNNING TRACING INFRASTRUCTURE AT SCALE

Storage Backend Selection

Backend	Scale	Cost
In-memory	Dev only	Free
Badger	Single node	Free
Elasticsearch	High	Medium
OpenSearch	High	Medium
Cassandra	Very High	High ops
ClickHouse	High	Low
Grafana Tempo	Cloud-scale	Object store

OTel Collector — Production Config

Run Collector as a DaemonSet (one per node) for local span collection — low latency, no network hops from pods
Or as a Deployment (central gateway) for batching, fan-out, and tail sampling
Use both: DaemonSet forwards to a central gateway Deployment that does tail sampling
Size memory: 2–4GB per collector at 10k spans/sec with tail sampling enabled
Always set memory_limiter processor to prevent OOM under spike

YAML — OTel Collector Production Pipeline

# collector-production.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: "0.0.0.0:4317" }
      http: { endpoint: "0.0.0.0:4318" }

processors:
  memory_limiter:
    limit_mib: 3000
    spike_limit_mib: 500
    check_interval: 5s

  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

  attributes/sanitize:   # Remove PII before storage
    actions:
      - key: db.statement    action: update  value: "[sanitized]"
        regexp: '(?i)(password|secret|token)\s*=\s*\S+'
      - key: http.request.header.authorization  action: delete

  tail_sampling:         # Keep errors + slow traces
    decision_wait: 10s
    policies:
      - name: errors    type: status_code  status_code: { status_codes: [ERROR] }
      - name: slow      type: latency      latency: { threshold_ms: 2000 }
      - name: base      type: probabilistic  probabilistic: { sampling_percentage: 5 }

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector.observability.svc:4317
    tls: { insecure: false, ca_file: /certs/ca.pem }

  zipkin:                  # Fan out to Zipkin for legacy consumers
    endpoint: http://zipkin.observability.svc:9411/api/v2/spans

  otlp/tempo:             # And Grafana Tempo for long-term storage
    endpoint: tempo.observability.svc:4317

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, attributes/sanitize, batch, tail_sampling]
      exporters:  [otlp/jaeger, zipkin, otlp/tempo]

Tracing Maturity Model

// WHERE ARE YOU TODAY

Most organizations start with ad-hoc traces on a few services and build toward full-system observability. This model shows the progression. Focus on getting all services to Level 2 before perfecting any one service at Level 4.

Level 01

Blind

No distributed tracing
Log correlation by request ID only
Manual service topology diagrams
Latency debugging = log trawling
No cross-service visibility

Level 02

Aware

OTel auto-instrumentation on HTTP
Traces visible in Jaeger/Zipkin
Basic sampling (probabilistic)
Some services missing context propagation
No LLM / queue tracing

Level 03

Operational

All services traced, propagation correct
DB, cache, queue spans included
Tail-based sampling for errors/slowness
Business attributes on key spans
OTel Collector pipeline in production

Level 04

Optimal

LLM calls as rich spans with tokens/cost
Span links for async/queue flows
Traces correlated with logs & metrics
Adaptive sampling per-service
Automated anomaly detection on traces

▸

Reference reading: Google Dapper paper (2010) — the theoretical foundation. OpenTelemetry docs at opentelemetry.io. Jaeger documentation at jaegertracing.io. W3C TraceContext spec at w3.org/TR/trace-context. OTel GenAI semantic conventions at opentelemetry.io/docs/specs/semconv/gen-ai. Grafana Tempo for cost-efficient trace storage at object-store scale.