OpenTelemetry Handbook — Traces, Metrics & Logs

📡

What is OpenTelemetry?

// otel.io — the observability framework, not a backend

OpenTelemetry (OTel) is a CNCF project that provides a unified API, SDK, and tooling for collecting observability data from your services. It's not a monitoring backend — it's the instrumentation layer that sits between your code and the backends you already use (Datadog, Grafana, Jaeger, Honeycomb, etc.).

Before OTel, every vendor had its own SDK. If you used Datadog's tracer, you were locked in. OTel breaks that coupling: instrument once with the OTel API, route telemetry anywhere.

Vendor-Neutral

Write instrumentation once against the OTel API. Switch backends by changing a config file — no code changes. Jaeger today, Honeycomb tomorrow.

Three Signals

Traces, Metrics, and Logs under one unified model with correlated context. See a metric spike, jump to the trace, read the log — all linked by trace_id.

Any Language

SDKs for Go, Python, Java, JavaScript/Node, .NET, Rust, Ruby, PHP, and more. Auto-instrumentation for popular frameworks requires zero code changes.

🎯

The golden rule: OTel is the pipe, not the destination. You still need a backend that stores and queries telemetry data. OTel just ensures the data gets there in a standard format (OTLP — OpenTelemetry Protocol) regardless of which backend you choose.

OTel vs. what came before

Aspect	Old World (vendor SDKs)	OpenTelemetry
Instrumentation	Per-vendor SDK in every service	One OTel SDK, any backend
Lock-in	High — migration = re-instrumentation	Zero — swap exporters, keep code
Correlation	Traces and logs siloed	Unified `trace_id` / `span_id` across all signals
Format	Proprietary per vendor	OTLP (open standard)
Auto-instrumentation	Vendor-specific agents	OTel zero-code instrumentation
Governance	Vendor-controlled	CNCF — open governance

⚡

The Three Signals

// traces · metrics · logs — what they are, what they answer

Each signal answers a different observability question. They are complementary — not interchangeable. Using all three gives you the full picture; using only one leaves blind spots.

Signal	Answers	Shape of Data	Cardinality	Storage Cost
● Trace	"What happened and how long did each step take?"	Tree of spans with timings, attributes, events	High — one trace per request	High — sampled to manage
● Metric	"How is my system performing over time?"	Numerical time-series with labels/dimensions	Low-medium — aggregated	Low — pre-aggregated
● Log	"What exactly happened at this moment?"	Timestamped text/JSON with structured fields	Very high — every event	Very high — raw events

Signal correlation — the real power

OTel injects trace_id and span_id into logs automatically when you use the OTel Logs bridge. This means every log line emitted during a request is linkable to its trace, and every metric can carry exemplars pointing to a sample trace.

🔗

Exemplars: A Prometheus metric can carry an exemplar — a pointer to a specific trace ID that caused a high-latency observation. This bridges the gap between "p99 spiked at 14:32" and "here's the actual slow request in Jaeger."

🧱

Core Concepts

// API · SDK · providers · resources · context

API

OTel API

Stable interfaces you write instrumentation against. Zero implementation — it's a no-op until an SDK is configured. Libraries should depend only on the API, never the SDK.

SDK

OTel SDK

The implementation of the API. Handles sampling, batching, and exporting. Applications configure and install the SDK at startup. Libraries must never install an SDK.

RESOURCE

Resource

Immutable description of the entity producing telemetry. Set once at startup: service.name, service.version, deployment.environment, k8s.pod.name, etc.

PROVIDER

TracerProvider / MeterProvider / LoggerProvider

The entry point for creating tracers, meters, and loggers. Configure exporters and processors here. Treat as a singleton — one per signal per process.

CONTEXT

Context & Propagation

Carries trace_id, span_id, and baggage across process boundaries via HTTP headers (W3C traceparent), gRPC metadata, or message queues.

OTLP

OTLP Protocol

OpenTelemetry Protocol — the standard wire format (Protobuf over gRPC or HTTP/JSON). The OTel Collector and all major backends speak OTLP natively.

⚠️

Library authors: Only depend on opentelemetry-api. Never depend on or configure the SDK. If a library configures a TracerProvider, it will break applications that have their own SDK setup. The application owns SDK configuration — always.

🔭

Architecture Overview

// how telemetry flows from code to backend

📱

Your Code

App / Service

→

📦

OTel SDK

API + SDK

→

🔄

Collector

OTel Collector

→

🔀

Processor

Filter / Enrich

→

📤

Exporter

OTLP / Native

→

🖥️

Backend

Jaeger / Grafana…

The OTel Collector is optional but strongly recommended in production. Without it, your services export directly to backends, coupling them to backend availability and format. With it, you gain a single control plane: filter noisy spans, add attributes, fan out to multiple backends, and change destinations without touching application code.

Direct Export (Dev / Simple)

Service → OTLP Exporter → Backend directly. Fine for development or very simple setups. No Collector needed. Trade-off: every service is coupled to backend configuration.

Collector Sidecar (Production)

Service → local Collector → central Collector gateway → backends. Services only know about localhost:4317. The Collector fleet handles routing, retry, batching, and fanout.

🔍

Tracing Concepts ● Trace

// distributed request tracing — follow a request across services

A trace is the complete record of a single request as it travels through your system. It's a directed acyclic graph (usually visualized as a waterfall) where each node is a span — a unit of work with a start time, duration, and metadata.

Trace

The entire journey of one request. Identified by a trace_id (128-bit UUID). All spans sharing a trace_id belong to the same request, regardless of which service created them.

Span

A single unit of work: "DB query", "HTTP call", "message publish". Has a span_id, a reference to its parent span, a start time, duration, status, and key-value attributes.

Span status codes

Status	When to use	Sets Error?
`UNSET`	Default — no explicit status set. The operation completed, but hasn't been evaluated for success or error.	No
`OK`	Explicitly mark as successful. Use sparingly — only when a downstream system needs to override error status from a library.	No
`ERROR`	An error occurred. Set the `exception` event and `error.type` attribute. This will show as a red span in tracing UIs.	Yes

✅

Span status rule: Only set ERROR when your layer considers the operation failed. A 404 response is not an error in the HTTP client span — it's a valid response. But it may be an error in the business logic span above it. Think in terms of responsibility boundaries.

📏

Spans & Attributes ● Trace

// naming, attributes, events, links

Span naming — the most important thing you'll get wrong

Span names are used for aggregation in tracing UIs. A bad name means you can't group related spans. A span name must be low-cardinality — do not include dynamic data like IDs or user emails in the name.

✓ DO — low cardinality, meaningful

// HTTP spans: "{method} {route pattern}"
"GET /users/{userId}"     ✓
"POST /orders"           ✓

// DB spans: "{operation} {table}"
"SELECT orders"          ✓
"INSERT users"           ✓

// Internal: "{class}.{method}"
"OrderService.PlaceOrder" ✓

✕ DON'T — high cardinality, useless aggregation

// Never include dynamic values in span name
"GET /users/12345"        ✗ user ID
"Processing order 98765"  ✗ order ID

// Too vague
"process"                 ✗ meaningless
"handler"                 ✗ meaningless

// Includes PII
"user@company.com login"  ✗ PII in name

Attributes

Attributes are key-value pairs attached to a span. They provide context for analysis. Follow semantic conventions for standard attributes — roll your own only for domain-specific data.

pythoninstrumentation.py

from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes

tracer = trace.get_tracer("my.library", "1.0.0")

with tracer.start_as_current_span("OrderService.PlaceOrder") as span:
    # Standard semantic convention attributes
    span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")
    span.set_attribute(SpanAttributes.DB_NAME, "orders")
    span.set_attribute(SpanAttributes.DB_STATEMENT, "INSERT INTO orders ...")

    # Custom domain attributes — use reverse-DNS namespace
    span.set_attribute("order.customer_id", customer_id)  # low-cardinality
    span.set_attribute("order.item_count", len(items))
    span.set_attribute("order.total_usd", total)

    # Record an exception — this sets status=ERROR automatically
    try:
        result = place_order(order)
    except Exception as e:
        span.record_exception(e)
        span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
        raise

    # Events — timestamped annotations inside a span
    span.add_event("payment.authorized", {"payment.provider": "stripe"})
    span.add_event("inventory.reserved", {"warehouse.id": "WH-03"})

⚠️

Attribute cardinality kills performance. Never use attributes with unbounded values (user IDs, request IDs, raw SQL queries) as metric dimensions. In traces it's fine — in metrics it will blow up your backend's cardinality limits and cost you thousands per month.

💻

Trace API Usage ● Trace

// creating tracers, spans, and wiring SDK

Python

pythonapp.py

# pip install opentelemetry-sdk opentelemetry-exporter-otlp
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME

# 1. Configure resource (who is sending this?)
resource = Resource.create({
    SERVICE_NAME: "order-service",
    "service.version": "2.4.1",
    "deployment.environment": "production",
})

# 2. Build the TracerProvider with exporter
provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)

# 3. Get a tracer — do this once per module/library
tracer = trace.get_tracer("order_service.checkout", "2.4.1")

# 4. Create spans
def checkout(cart_id: str):
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("cart.id", cart_id)
        validate_cart(cart_id)    # child spans created inside here
        charge_payment(cart_id)   # are automatically children of "checkout"

# 5. Manual child span with context manager
def validate_cart(cart_id: str):
    with tracer.start_as_current_span("CartService.validate") as span:
        span.set_attribute("cart.id", cart_id)
        # ... business logic

JavaScript / Node.js

javascripttracing.js

// npm i @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes as SRA } from '@opentelemetry/semantic-conventions';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

// Call this BEFORE requiring your app code
const sdk = new NodeSDK({
  resource: new Resource({
    [SRA.SERVICE_NAME]: 'payment-service',
    [SRA.SERVICE_VERSION]: process.env.APP_VERSION,
    [SRA.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({ url: 'http://collector:4317' }),
  instrumentations: [getNodeAutoInstrumentations()],  // Express, http, pg, redis…
});
sdk.start();

// Manual spans in application code
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');

async function chargeCard(orderId: string, amount: number) {
  return tracer.startActiveSpan('PaymentService.charge', async (span) => {
    span.setAttribute('order.id', orderId);
    span.setAttribute('payment.amount_usd', amount);
    try {
      const result = await stripe.charge({ amount, currency: 'usd' });
      span.setAttribute('payment.charge_id', result.id);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();  // always end spans
    }
  });
}

⛔

Always end spans. A span that is never ended will leak memory and may never be exported. Use try/finally, context managers (Python), or the SDK's startActiveSpan callback pattern (JS) which ends the span automatically.

🔗

Context Propagation ● Trace

// W3C traceparent · B3 · baggage — crossing service boundaries

Context propagation is how a trace_id flows from Service A to Service B to Service C. Without it, you get disconnected spans that each look like independent requests. With it, you get the full distributed waterfall.

Propagator	Header	Use when
W3C TraceContext	`traceparent`, `tracestate`	Default. Use this everywhere. IETF standard.
B3 Single	`b3`	Legacy — services using Zipkin/older Brave instrumentation
B3 Multi	`X-B3-TraceId`, `X-B3-SpanId`…	Legacy Zipkin multi-header environments
Baggage	`baggage`	User-defined key-value pairs propagated downstream (use sparingly)
AWS X-Ray	`X-Amzn-Trace-Id`	When target is AWS Lambda/API Gateway

pythonhttp_client.py

from opentelemetry import propagate
import requests

# Inject context into outgoing HTTP headers (auto done by instrumented libs)
def call_inventory_service(product_id: str):
    headers = {}
    propagate.inject(headers)   # adds traceparent + baggage headers
    response = requests.get(
        f"http://inventory-svc/products/{product_id}",
        headers=headers
    )
    return response.json()

# Extract context from incoming request (frameworks handle this automatically)
def handle_request(request_headers: dict):
    ctx = propagate.extract(request_headers)
    # ctx now contains the parent span context
    with tracer.start_as_current_span("handle", context=ctx) as span:
        pass

# Baggage — propagate business context downstream (use cautiously)
from opentelemetry.baggage import set_baggage, get_baggage
ctx = set_baggage("tenant.id", "acme-corp")
# Downstream services can read: get_baggage("tenant.id")

⚠️

Baggage is propagated to all downstream services and is visible in headers. Never put sensitive data (PII, secrets, tokens) in baggage. It's for low-cardinality routing hints like tenant ID or feature flags — not user data.

📊

Metrics Concepts ● Metric

// time series, dimensions, aggregation

Metrics are aggregated numerical measurements over time. Unlike traces (one per request), metrics collapse many observations into a single time-series data point. They're cheap to store, fast to query, and ideal for dashboards and alerts.

A metric is identified by its name, unit, description, and a set of attributes (dimensions). The combination of metric name + attributes = a unique time series.

⛔

Cardinality explosion. Adding a high-cardinality attribute (like user_id or request_id) to a metric creates one time series per unique value. 1M users = 1M series. This will make your backend OOM and your bill astronomical. Attributes on metrics must be low-cardinality. Rule of thumb: < 100 unique values per attribute.

Metric data model — the four types of data points

Data Type	Description	Aggregation
Sum	Cumulative or delta sum of measurements. Monotonic (always increasing) or non-monotonic.	Sum over interval
Gauge	Instantaneous value — no aggregation across time. Current memory, CPU, queue depth.	Last value
Histogram	Distribution of values — bucket counts, sum, count. Powers p50/p99/p999 latency.	Configurable buckets
ExponentialHistogram	Adaptive-bucket histogram — higher resolution without pre-configuring buckets.	Adaptive buckets

🎚️

Instrument Types ● Metric

// counter · gauge · histogram · updown counter — which to use when

Instrument	Measurement	Default Aggregation	Use for
`Counter`	Monotonically increasing value you add to	Sum	Requests processed, bytes sent, errors, jobs completed
`UpDownCounter`	Value that can increase or decrease	Sum	Active connections, queue depth, items in cache
`Histogram`	Individual measurements — builds distribution	Explicit bucket histogram	Request latency, response size, batch sizes
`Gauge`	Snapshot of current value	Last value	CPU %, memory used, fan speed, temperature
`ObservableCounter`	Counter value read via callback	Sum	System counters you read (not control): CPU time, GC collections
`ObservableGauge`	Gauge value read via callback	Last value	System metrics: heap used, thread count, connection pool size
`ObservableUpDownCounter`	UpDownCounter read via callback	Sum	Number of goroutines, active DB connections (pull-style)

💡

Synchronous vs Observable: Synchronous instruments (Counter, Histogram) are recorded at the time of the event — "I just processed a request." Observable instruments are polled on each collection interval via a callback — "how many goroutines exist right now?" Use Observable when you're reading system state, not when you're counting events.

💻

Metrics API Usage ● Metric

// creating meters and instruments in Python and JS

pythonmetrics.py

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

# 1. Configure MeterProvider
reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://collector:4317"),
    export_interval_millis=60_000  # export every 60s
)
provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)

# 2. Get a meter — once per library/component
meter = metrics.get_meter("order_service", "2.4.1")

# 3. Create instruments — at module load time, not per request
order_counter = meter.create_counter(
    "orders.placed",
    unit="{orders}",
    description="Total number of orders placed",
)
latency_histogram = meter.create_histogram(
    "orders.processing.duration",
    unit="s",
    description="Time to process an order end-to-end",
)
active_orders = meter.create_up_down_counter(
    "orders.in_flight",
    unit="{orders}",
    description="Orders currently being processed",
)

# 4. Record measurements — with low-cardinality attributes only
def process_order(order):
    attrs = {"order.type": order.type, "region": order.region}
    active_orders.add(1, attrs)
    start = time.monotonic()
    try:
        result = do_process(order)
        attrs["outcome"] = "success"
        return result
    except Exception:
        attrs["outcome"] = "error"
        raise
    finally:
        latency_histogram.record(time.monotonic() - start, attrs)
        order_counter.add(1, attrs)
        active_orders.add(-1, attrs)

# 5. Observable gauge — reads system state via callback
def observe_queue_depth(options):
    yield metrics.Observation(get_queue_depth(), {"queue": "orders"})

meter.create_observable_gauge(
    "orders.queue.depth",
    callbacks=[observe_queue_depth],
    unit="{items}",
)

📐

Naming convention: Use dot-separated hierarchical names: http.server.request.duration, db.client.connections.count. Follow the OTel semantic conventions naming schema. Unit goes in the unit field, not the name ("ms", "s", "By", "{requests}").

📋

Log Signals ● Log

// structured logs with trace correlation

OTel Logs are the newest of the three signals (GA'd after traces and metrics). The OTel Logs API does not replace your logging library — it bridges your existing logger (loguru, structlog, winston, serilog) into the OTel pipeline, injecting trace_id and span_id automatically.

OTel Log Record

A structured event with a timestamp, severity, body (message), and a map of attributes. Plus trace_id and span_id when emitted in a trace context.

Log Bridge API

Connects your existing logging library to the OTel LoggerProvider. The bridge reads log records from the library and translates them into OTel log records — no rewrite required.

Severity levels mapping

OTel Severity	Number	Syslog	Log4j	Python logging
`TRACE`	1–4	—	TRACE	—
`DEBUG`	5–8	debug	DEBUG	DEBUG (10)
`INFO`	9–12	info	INFO	INFO (20)
`WARN`	13–16	warning	WARN	WARNING (30)
`ERROR`	17–20	error	ERROR	ERROR (40)
`FATAL`	21–24	crit/alert/emerg	FATAL	CRITICAL (50)

💻

Logs API Usage ● Log

// bridging existing loggers · trace correlation

pythonlogging_setup.py

# pip install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-logging
import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter

# 1. Configure the OTel LoggerProvider
log_provider = LoggerProvider(resource=resource)
log_provider.add_log_record_processor(
    BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://collector:4317"))
)
set_logger_provider(log_provider)

# 2. Bridge Python's stdlib logging → OTel
handler = LoggingHandler(level=logging.NOTSET, logger_provider=log_provider)
logging.getLogger().addHandler(handler)

# 3. Use your logger normally — trace_id injected automatically
logger = logging.getLogger("order_service.checkout")

def checkout(cart_id: str):
    with tracer.start_as_current_span("checkout"):
        # This log will automatically have trace_id + span_id attached
        logger.info("Starting checkout", extra={"cart.id": cart_id})
        try:
            result = process(cart_id)
            logger.info("Checkout complete", extra={"order.id": result.id})
        except Exception as e:
            logger.error("Checkout failed", exc_info=e,
                         extra={"cart.id": cart_id})

✓ DO — structured, low-cardinality

# Structured fields — searchable
logger.info("Order placed", extra={
    "order.id": order.id,
    "order.type": order.type,
    "order.region": order.region,
})
# Exception with full stack trace
logger.error("Payment failed",
    exc_info=True,
    extra={"payment.provider": "stripe"}
)

✕ DON'T — unstructured, unsearchable

# String interpolation — can't query by field
logger.info(
    f"Order {order.id} placed by {user.email}"
)
# PII in logs (GDPR violation!)
logger.debug(f"User card: {card_number}")

# Swallowing the exception object
except Exception as e:
    logger.error(str(e))  # loses stack trace

🔄

OTel Collector

// receive · process · export — the telemetry control plane

The OTel Collector is a standalone binary (or Docker image) that receives telemetry from your services, processes it, and exports it to backends. It's the recommended deployment pattern for production — your services only talk to a local Collector, never directly to a backend.

📥

Receivers

OTLP · Jaeger
Prometheus · Zipkin

→

⚙️

Processors

Batch · Filter
Enrich · Sample

→

📤

Exporters

OTLP · Jaeger
Prometheus · Custom

yamlcollector-config.yaml

# OTel Collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317   # gRPC
      http:
        endpoint: 0.0.0.0:4318   # HTTP/JSON
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-app'
          static_configs:
            - targets: ['app:8080']

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    limit_mib: 512
    check_interval: 1s
  resource:
    attributes:
      - key: deployment.environment
        value: "production"
        action: upsert              # add/overwrite resource attributes
  filter/drop_health_spans:       # drop noisy health-check traces
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
  tail_sampling:                  # keep all errors + 10% of success
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  otlp/honeycomb:
    endpoint: api.honeycomb.io:443
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, batch, filter/drop_health_spans, tail_sampling]
      exporters:  [otlp/jaeger, otlp/honeycomb]
    metrics:
      receivers:  [otlp, prometheus]
      processors: [memory_limiter, batch, resource]
      exporters:  [prometheusremotewrite]
    logs:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [otlp/honeycomb]

Collector distributions

Distribution	Use when
`otelcol` (core)	Minimal build — only OTel-maintained components. Recommended for most.
`otelcol-contrib`	Full distribution — all community components. Good for evaluating; large binary.
OCB (OTel Collector Builder)	Production — build a custom binary with only the components you need. Minimizes attack surface and binary size.
Vendor distributions	Grafana Agent, Elastic Agent, Datadog Agent — vendor-managed Collectors with extra integrations.

📤

Exporters

// how telemetry leaves the SDK and reaches backends

Exporter	Protocol	Use for
`OTLP/gRPC`	Protobuf over gRPC	Default. Best performance. Use in production.
`OTLP/HTTP`	Protobuf or JSON over HTTP	Environments where gRPC is blocked (some firewalls, browsers).
`Jaeger`	Jaeger Thrift / gRPC	Direct to Jaeger without Collector (legacy pattern).
`Zipkin`	Zipkin JSON	Legacy Zipkin backends.
`Prometheus`	Prometheus text/pull	Expose metrics on `/metrics` for Prometheus scraping.
`Console`	Stdout JSON	Development / debugging only. Never in production.
`In-memory`	In-process	Testing — assertions on exported spans/metrics in unit tests.

✅

Always use OTLP + Collector in production. Export to localhost:4317 (your Collector sidecar). Never hardcode backend-specific exporters in application code — that recreates vendor lock-in at the SDK level. The Collector handles routing.

⚙️

SDK Configuration

// environment variables · sampling · zero-code instrumentation

Environment variables — configure without code changes

bash.env / Kubernetes env

# Service identity (REQUIRED)
OTEL_SERVICE_NAME="order-service"
OTEL_SERVICE_VERSION="2.4.1"

# Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT="http://collector:4317"     # all signals
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://collector:4317"  # traces only
OTEL_EXPORTER_OTLP_PROTOCOL="grpc"                         # or http/protobuf

# Resource attributes
OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,k8s.cluster.name=prod-us-east"

# Sampling
OTEL_TRACES_SAMPLER="parentbased_traceidratio"
OTEL_TRACES_SAMPLER_ARG="0.1"          # sample 10% of root spans

# Propagation
OTEL_PROPAGATORS="tracecontext,baggage"  # W3C TraceContext + Baggage

# Disable specific signals
OTEL_TRACES_EXPORTER="none"             # disable traces
OTEL_METRICS_EXPORTER="none"            # disable metrics

Sampling strategies

Sampler	Behavior	Use for
`always_on`	Sample 100% of traces	Development, low-traffic services
`always_off`	Drop all traces	Disabling tracing temporarily
`traceidratio`	Sample N% of all traces regardless of parent	Rarely useful — ignores parent sampling decision
`parentbased_always_on`	Respect parent sampling; always_on for roots	Dev/test environments
`parentbased_traceidratio`	Respect parent sampling; ratio for roots	Production — standard choice
Tail sampling (Collector)	Sample after seeing the full trace	Keep all errors + fast traces, sample slow traces

🎯

Tail sampling is the most powerful strategy: the Collector waits for the entire trace to arrive (configurable buffer window, e.g. 10s), then decides whether to keep or drop it based on the full picture — was there an error? Was it slow? Did it touch a specific service? This is impossible with head sampling (which decides at the start of a trace).

📐

Semantic Conventions

// standard attribute names — the OTel vocabulary

Semantic conventions define standardized attribute names for common operations. Using them means your telemetry works out-of-the-box with dashboards, queries, and alerting rules built for OTel. Deviating from them means building everything from scratch.

Domain	Key Attributes
HTTP Server	`http.request.method`, `http.response.status_code`, `url.path`, `server.address`, `http.route`
HTTP Client	`http.request.method`, `http.response.status_code`, `url.full`, `server.address`, `server.port`
Database	`db.system`, `db.name`, `db.operation`, `db.statement`, `server.address`, `db.sql.table`
Messaging	`messaging.system`, `messaging.destination.name`, `messaging.operation`, `messaging.message.id`
RPC	`rpc.system`, `rpc.service`, `rpc.method`, `rpc.grpc.status_code`
Exceptions	`exception.type`, `exception.message`, `exception.stacktrace`
Service	`service.name`, `service.version`, `service.instance.id`
Kubernetes	`k8s.pod.name`, `k8s.namespace.name`, `k8s.deployment.name`, `k8s.cluster.name`
Cloud	`cloud.provider`, `cloud.region`, `cloud.account.id`, `cloud.availability_zone`

📖

Stability: Semantic conventions have stability markers: stable, experimental, and deprecated. The HTTP and database conventions stabilized in 2024. Check opentelemetry.io/docs/specs/semconv before naming custom attributes — the convention you want may already exist.

🎯

When to Use What

// choosing the right signal for the right question

Question / Use Case	Trace	Metric	Log
Why did this specific request fail?	PRIMARY	hints	details
What is my p99 latency over the last 24h?	too costly	PRIMARY	no
How many orders per second are we processing?	no	PRIMARY	no
Which microservice is the bottleneck in my request path?	PRIMARY	confirm	no
Did the background job succeed for tenant X?	if traced	no	PRIMARY
Alert when error rate exceeds 1%	no	PRIMARY	some tools
What SQL queries ran during a slow request?	PRIMARY	no	verbose mode
How many users triggered feature flag X today?	no	PRIMARY	if logged
Audit trail — who deleted record 12345?	no	no	PRIMARY
Is this a new deployment causing elevated errors?	sample traces	PRIMARY	confirm

🔁

The observability loop: Alerts fire on metrics. You jump to traces to find the specific bad request. You read logs to understand exactly what happened inside it. Each signal leads you to the next. This is why correlation (shared trace_id) between all three is the real value of OTel.

🏗️

Real-World Patterns

// production-grade instrumentation patterns

Pattern 1 — Auto-instrumentation first

Before writing a single line of manual instrumentation, install the auto-instrumentation package for your framework. It instruments all HTTP calls, database queries, messaging, and more with zero code. Add manual spans only for business logic that auto-instrumentation can't see.

bashzero-code instrumentation

# Python — auto-instrument Flask + SQLAlchemy + Redis with zero code changes
pip install opentelemetry-instrumentation-flask \
            opentelemetry-instrumentation-sqlalchemy \
            opentelemetry-instrumentation-redis

opentelemetry-instrument \
  --traces_exporter console \
  --service_name my-flask-app \
  python app.py

# Node.js — auto-instrument Express + Postgres + Redis
npm install @opentelemetry/auto-instrumentations-node
node --require @opentelemetry/auto-instrumentations-node/register app.js

# Java — zero-code Java agent (attach to JVM)
java -javaagent:opentelemetry-javaagent.jar \
     -Dotel.service.name=my-java-svc \
     -Dotel.exporter.otlp.endpoint=http://collector:4317 \
     -jar myapp.jar

Pattern 2 — Kubernetes sidecar deployment

yamldeployment.yaml

spec:
  containers:
    - name: app
      image: myapp:2.4.1
      env:
        - name: OTEL_SERVICE_NAME
          value: "order-service"
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://localhost:4317"  # sidecar on localhost
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "k8s.pod.name=$(MY_POD_NAME),k8s.namespace.name=$(MY_POD_NAMESPACE)"

    - name: otel-collector
      image: otel/opentelemetry-collector-contrib:latest
      args: ["--config=/etc/otel/config.yaml"]
      ports:
        - containerPort: 4317  # gRPC
        - containerPort: 4318  # HTTP
      volumeMounts:
        - name: otel-config
          mountPath: /etc/otel
  volumes:
    - name: otel-config
      configMap:
        name: otel-collector-config

Pattern 3 — Correlation: linking metrics to traces

pythonexemplar_demo.py

from opentelemetry import trace, metrics, context

# Exemplars are attached automatically when the SDK detects
# an active span at recording time — no extra code needed.
# The histogram observation will carry the current trace_id.

request_duration = meter.create_histogram(
    "http.server.request.duration",
    unit="s",
)

def handle_request(path: str):
    with tracer.start_as_current_span("handle_request"):
        start = time.monotonic()
        response = route(path)
        # This histogram record automatically gets an exemplar
        # pointing to the current span — no manual work needed
        request_duration.record(
            time.monotonic() - start,
            {"http.route": path, "http.response.status_code": response.status}
        )
        return response

# In Grafana: when you see a p99 spike, click the exemplar
# to jump directly to the corresponding trace in Tempo/Jaeger.

Pattern 4 — Testing your instrumentation

pythontest_instrumentation.py

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

def test_checkout_creates_span_with_cart_id():
    # Use in-memory exporter — no real backend needed
    exporter = InMemorySpanExporter()
    provider = TracerProvider()
    provider.add_span_processor(SimpleSpanProcessor(exporter))

    with trace_as(provider):
        checkout(cart_id="cart-123")

    spans = exporter.get_finished_spans()
    assert len(spans) == 1
    span = spans[0]
    assert span.name == "OrderService.checkout"
    assert span.attributes["cart.id"] == "cart-123"
    assert span.status.status_code == StatusCode.OK

🖥️

Backends & Tools

// where your telemetry lands — OSS and commercial

Tool	Signals	OSS?	Notes
Jaeger	Traces	✅ CNCF	The OSS trace standard. Native OTLP support. Excellent for distributed tracing UI.
Grafana Tempo	Traces	✅ OSS	Scalable trace storage, integrates with Grafana. Tag-based search, exemplar linking with Prometheus.
Prometheus	Metrics	✅ CNCF	The OSS metrics standard. Pull-based. Use with OTEL Prometheus exporter or remote write from Collector.
Grafana Mimir	Metrics	✅ OSS	Scalable Prometheus-compatible metrics backend. HA, long-term storage, multi-tenant.
Grafana Loki	Logs	✅ OSS	Log aggregation designed for Grafana. Label-based indexing, LogQL query language.
Elastic (APM)	Traces + Metrics + Logs	⚠️ Source-avail	Full OTel support via OTLP. Powerful search. Free self-hosted tier.
Honeycomb	Traces + Logs	❌ Commercial	Best-in-class distributed tracing UX. Wide-event model. Native OTLP. Generous free tier.
Datadog	Traces + Metrics + Logs	❌ Commercial	Full platform. Native OTel support. Premium cost; industry standard in enterprise.
Grafana Cloud	Traces + Metrics + Logs	❌ Managed OSS	Managed Tempo + Mimir + Loki + Grafana. Generous free tier. Native OTLP ingest.
AWS X-Ray / CloudWatch	Traces + Metrics + Logs	❌ Commercial	Tightly integrated with AWS. OTel support via ADOT (AWS Distro for OTel).

📚

Reference

// quick rules, links, and the most common mistakes

The ten most common OTel mistakes

MISTAKE

High-cardinality metric attributes

Adding user_id or request_id as metric attributes. This creates millions of time series and crashes backends. Metrics attributes must be low-cardinality (<100 values).

MISTAKE

Forgetting to end spans

Spans that are never ended leak memory and are never exported. Always use try/finally, context managers, or the SDK's callback-style API.

MISTAKE

High-cardinality span names

"GET /users/12345" instead of "GET /users/{id}". Span names are used for aggregation — variable data goes in attributes, not names.

MISTAKE

Library code installing the SDK

Libraries that call TracerProvider.install() will break every application using them. Libraries use the API only. Applications own SDK configuration.

MISTAKE

Exporting directly to backend from app

Hard-coding backend-specific exporters in application code recreates vendor lock-in. Export to the OTel Collector via OTLP. Let the Collector route to backends.

MISTAKE

Missing resource attributes

Not setting service.name, service.version, deployment.environment. Without these, you can't filter telemetry by service in any backend UI.

MISTAKE

100% sampling in production

Sampling every trace from a high-traffic service is prohibitively expensive. Use parentbased_traceidratio at 10% or tail sampling in the Collector to keep errors.

MISTAKE

PII in telemetry

Logging user emails, card numbers, or passwords in span attributes or log records. Telemetry backends are not GDPR-compliant stores for personal data. Scrub PII at the Collector or before recording.

MISTAKE

Using Console exporter in production

The ConsoleExporter / StdoutExporter is for development only. In production it floods stdout with JSON and loses all telemetry. Use OTLP.

MISTAKE

Ignoring propagation

Not injecting traceparent headers into outgoing HTTP calls or message queue payloads. Without this, traces are broken — every service looks like a standalone request.

Official resources

Core Documentation

opentelemetry.io/docs — Official docs and getting started
OTel Specification — The language-agnostic spec
Semantic Conventions — Standard attribute names
Collector repo — Core Collector
Collector contrib — Community components

Language SDKs

📡

OTel is infrastructure, not a feature. The best-instrumented services are the ones where you can jump from an alert to the root cause in under 60 seconds. OTel gives you the plumbing; good signal naming, sampling, and correlation give you the power. The investment pays for itself the first time you diagnose an incident in production with a full trace instead of grep.

The OpenTelemetryHandbook

What is OpenTelemetry?

OTel vs. what came before

The Three Signals

Signal correlation — the real power

Core Concepts

Architecture Overview

Tracing Concepts ● Trace

Span status codes

Spans & Attributes ● Trace

Span naming — the most important thing you'll get wrong

Attributes

Trace API Usage ● Trace

Python

JavaScript / Node.js

Context Propagation ● Trace

Metrics Concepts ● Metric

Metric data model — the four types of data points

Instrument Types ● Metric

Metrics API Usage ● Metric

Log Signals ● Log

Severity levels mapping

Logs API Usage ● Log

OTel Collector

Collector distributions

Exporters

SDK Configuration

Environment variables — configure without code changes

Sampling strategies

Semantic Conventions

When to Use What

Real-World Patterns

Pattern 1 — Auto-instrumentation first

Pattern 2 — Kubernetes sidecar deployment

Pattern 3 — Correlation: linking metrics to traces

Pattern 4 — Testing your instrumentation

Backends & Tools

Reference

The ten most common OTel mistakes

Official resources

The OpenTelemetry
Handbook