??????? OpenTelemetry Handbook — Traces, Metrics & Logs
V1
OpenTelemetry // Handbook
Traces Metrics Logs
Observability Engineering

The OpenTelemetry
Handbook

The vendor-neutral industry standard for instrumentation. Learn to emit, collect, and export traces, metrics, and logs from any service — and actually understand what your systems are doing in production.

Distributed Traces Metrics & Instruments Structured Logs OTLP & Collector
📡

OpenTelemetry (OTel) is a CNCF project that provides a unified API, SDK, and tooling for collecting observability data from your services. It's not a monitoring backend — it's the instrumentation layer that sits between your code and the backends you already use (Datadog, Grafana, Jaeger, Honeycomb, etc.).

Before OTel, every vendor had its own SDK. If you used Datadog's tracer, you were locked in. OTel breaks that coupling: instrument once with the OTel API, route telemetry anywhere.

Vendor-Neutral

Write instrumentation once against the OTel API. Switch backends by changing a config file — no code changes. Jaeger today, Honeycomb tomorrow.

Three Signals

Traces, Metrics, and Logs under one unified model with correlated context. See a metric spike, jump to the trace, read the log — all linked by trace_id.

Any Language

SDKs for Go, Python, Java, JavaScript/Node, .NET, Rust, Ruby, PHP, and more. Auto-instrumentation for popular frameworks requires zero code changes.

🎯
The golden rule: OTel is the pipe, not the destination. You still need a backend that stores and queries telemetry data. OTel just ensures the data gets there in a standard format (OTLP — OpenTelemetry Protocol) regardless of which backend you choose.

OTel vs. what came before

AspectOld World (vendor SDKs)OpenTelemetry
InstrumentationPer-vendor SDK in every serviceOne OTel SDK, any backend
Lock-inHigh — migration = re-instrumentationZero — swap exporters, keep code
CorrelationTraces and logs siloedUnified trace_id / span_id across all signals
FormatProprietary per vendorOTLP (open standard)
Auto-instrumentationVendor-specific agentsOTel zero-code instrumentation
GovernanceVendor-controlledCNCF — open governance

Each signal answers a different observability question. They are complementary — not interchangeable. Using all three gives you the full picture; using only one leaves blind spots.

SignalAnswersShape of DataCardinalityStorage Cost
● Trace "What happened and how long did each step take?" Tree of spans with timings, attributes, events High — one trace per request High — sampled to manage
● Metric "How is my system performing over time?" Numerical time-series with labels/dimensions Low-medium — aggregated Low — pre-aggregated
● Log "What exactly happened at this moment?" Timestamped text/JSON with structured fields Very high — every event Very high — raw events

Signal correlation — the real power

OTel injects trace_id and span_id into logs automatically when you use the OTel Logs bridge. This means every log line emitted during a request is linkable to its trace, and every metric can carry exemplars pointing to a sample trace.

🔗
Exemplars: A Prometheus metric can carry an exemplar — a pointer to a specific trace ID that caused a high-latency observation. This bridges the gap between "p99 spiked at 14:32" and "here's the actual slow request in Jaeger."
🧱
API
OTel API
Stable interfaces you write instrumentation against. Zero implementation — it's a no-op until an SDK is configured. Libraries should depend only on the API, never the SDK.
SDK
OTel SDK
The implementation of the API. Handles sampling, batching, and exporting. Applications configure and install the SDK at startup. Libraries must never install an SDK.
RESOURCE
Resource
Immutable description of the entity producing telemetry. Set once at startup: service.name, service.version, deployment.environment, k8s.pod.name, etc.
PROVIDER
TracerProvider / MeterProvider / LoggerProvider
The entry point for creating tracers, meters, and loggers. Configure exporters and processors here. Treat as a singleton — one per signal per process.
CONTEXT
Context & Propagation
Carries trace_id, span_id, and baggage across process boundaries via HTTP headers (W3C traceparent), gRPC metadata, or message queues.
OTLP
OTLP Protocol
OpenTelemetry Protocol — the standard wire format (Protobuf over gRPC or HTTP/JSON). The OTel Collector and all major backends speak OTLP natively.
⚠️
Library authors: Only depend on opentelemetry-api. Never depend on or configure the SDK. If a library configures a TracerProvider, it will break applications that have their own SDK setup. The application owns SDK configuration — always.
🔭
📱
Your Code
App / Service
📦
OTel SDK
API + SDK
🔄
Collector
OTel Collector
🔀
Processor
Filter / Enrich
📤
Exporter
OTLP / Native
🖥️
Backend
Jaeger / Grafana…

The OTel Collector is optional but strongly recommended in production. Without it, your services export directly to backends, coupling them to backend availability and format. With it, you gain a single control plane: filter noisy spans, add attributes, fan out to multiple backends, and change destinations without touching application code.

Direct Export (Dev / Simple)

Service → OTLP Exporter → Backend directly. Fine for development or very simple setups. No Collector needed. Trade-off: every service is coupled to backend configuration.

Collector Sidecar (Production)

Service → local Collector → central Collector gateway → backends. Services only know about localhost:4317. The Collector fleet handles routing, retry, batching, and fanout.

🔍

A trace is the complete record of a single request as it travels through your system. It's a directed acyclic graph (usually visualized as a waterfall) where each node is a span — a unit of work with a start time, duration, and metadata.

Trace

The entire journey of one request. Identified by a trace_id (128-bit UUID). All spans sharing a trace_id belong to the same request, regardless of which service created them.

Span

A single unit of work: "DB query", "HTTP call", "message publish". Has a span_id, a reference to its parent span, a start time, duration, status, and key-value attributes.

Span status codes

StatusWhen to useSets Error?
UNSETDefault — no explicit status set. The operation completed, but hasn't been evaluated for success or error.No
OKExplicitly mark as successful. Use sparingly — only when a downstream system needs to override error status from a library.No
ERRORAn error occurred. Set the exception event and error.type attribute. This will show as a red span in tracing UIs.Yes
Span status rule: Only set ERROR when your layer considers the operation failed. A 404 response is not an error in the HTTP client span — it's a valid response. But it may be an error in the business logic span above it. Think in terms of responsibility boundaries.
📏

Span naming — the most important thing you'll get wrong

Span names are used for aggregation in tracing UIs. A bad name means you can't group related spans. A span name must be low-cardinality — do not include dynamic data like IDs or user emails in the name.

✓ DO — low cardinality, meaningful
// HTTP spans: "{method} {route pattern}" "GET /users/{userId}" "POST /orders" // DB spans: "{operation} {table}" "SELECT orders" "INSERT users" // Internal: "{class}.{method}" "OrderService.PlaceOrder"
✕ DON'T — high cardinality, useless aggregation
// Never include dynamic values in span name "GET /users/12345" ✗ user ID "Processing order 98765" ✗ order ID // Too vague "process" ✗ meaningless "handler" ✗ meaningless // Includes PII "user@company.com login" ✗ PII in name

Attributes

Attributes are key-value pairs attached to a span. They provide context for analysis. Follow semantic conventions for standard attributes — roll your own only for domain-specific data.

pythoninstrumentation.py
from opentelemetry import trace from opentelemetry.semconv.trace import SpanAttributes tracer = trace.get_tracer("my.library", "1.0.0") with tracer.start_as_current_span("OrderService.PlaceOrder") as span: # Standard semantic convention attributes span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql") span.set_attribute(SpanAttributes.DB_NAME, "orders") span.set_attribute(SpanAttributes.DB_STATEMENT, "INSERT INTO orders ...") # Custom domain attributes — use reverse-DNS namespace span.set_attribute("order.customer_id", customer_id) # low-cardinality span.set_attribute("order.item_count", len(items)) span.set_attribute("order.total_usd", total) # Record an exception — this sets status=ERROR automatically try: result = place_order(order) except Exception as e: span.record_exception(e) span.set_status(trace.Status(trace.StatusCode.ERROR, str(e))) raise # Events — timestamped annotations inside a span span.add_event("payment.authorized", {"payment.provider": "stripe"}) span.add_event("inventory.reserved", {"warehouse.id": "WH-03"})
⚠️
Attribute cardinality kills performance. Never use attributes with unbounded values (user IDs, request IDs, raw SQL queries) as metric dimensions. In traces it's fine — in metrics it will blow up your backend's cardinality limits and cost you thousands per month.
💻

Python

pythonapp.py
# pip install opentelemetry-sdk opentelemetry-exporter-otlp from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource, SERVICE_NAME # 1. Configure resource (who is sending this?) resource = Resource.create({ SERVICE_NAME: "order-service", "service.version": "2.4.1", "deployment.environment": "production", }) # 2. Build the TracerProvider with exporter provider = TracerProvider(resource=resource) provider.add_span_processor( BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317")) ) trace.set_tracer_provider(provider) # 3. Get a tracer — do this once per module/library tracer = trace.get_tracer("order_service.checkout", "2.4.1") # 4. Create spans def checkout(cart_id: str): with tracer.start_as_current_span("checkout") as span: span.set_attribute("cart.id", cart_id) validate_cart(cart_id) # child spans created inside here charge_payment(cart_id) # are automatically children of "checkout" # 5. Manual child span with context manager def validate_cart(cart_id: str): with tracer.start_as_current_span("CartService.validate") as span: span.set_attribute("cart.id", cart_id) # ... business logic

JavaScript / Node.js

javascripttracing.js
// npm i @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; import { Resource } from '@opentelemetry/resources'; import { SemanticResourceAttributes as SRA } from '@opentelemetry/semantic-conventions'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; // Call this BEFORE requiring your app code const sdk = new NodeSDK({ resource: new Resource({ [SRA.SERVICE_NAME]: 'payment-service', [SRA.SERVICE_VERSION]: process.env.APP_VERSION, [SRA.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }), traceExporter: new OTLPTraceExporter({ url: 'http://collector:4317' }), instrumentations: [getNodeAutoInstrumentations()], // Express, http, pg, redis… }); sdk.start(); // Manual spans in application code import { trace } from '@opentelemetry/api'; const tracer = trace.getTracer('payment-service'); async function chargeCard(orderId: string, amount: number) { return tracer.startActiveSpan('PaymentService.charge', async (span) => { span.setAttribute('order.id', orderId); span.setAttribute('payment.amount_usd', amount); try { const result = await stripe.charge({ amount, currency: 'usd' }); span.setAttribute('payment.charge_id', result.id); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (err) { span.recordException(err); span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); throw err; } finally { span.end(); // always end spans } }); }
Always end spans. A span that is never ended will leak memory and may never be exported. Use try/finally, context managers (Python), or the SDK's startActiveSpan callback pattern (JS) which ends the span automatically.
🔗

Context propagation is how a trace_id flows from Service A to Service B to Service C. Without it, you get disconnected spans that each look like independent requests. With it, you get the full distributed waterfall.

PropagatorHeaderUse when
W3C TraceContexttraceparent, tracestateDefault. Use this everywhere. IETF standard.
B3 Singleb3Legacy — services using Zipkin/older Brave instrumentation
B3 MultiX-B3-TraceId, X-B3-SpanIdLegacy Zipkin multi-header environments
BaggagebaggageUser-defined key-value pairs propagated downstream (use sparingly)
AWS X-RayX-Amzn-Trace-IdWhen target is AWS Lambda/API Gateway
pythonhttp_client.py
from opentelemetry import propagate import requests # Inject context into outgoing HTTP headers (auto done by instrumented libs) def call_inventory_service(product_id: str): headers = {} propagate.inject(headers) # adds traceparent + baggage headers response = requests.get( f"http://inventory-svc/products/{product_id}", headers=headers ) return response.json() # Extract context from incoming request (frameworks handle this automatically) def handle_request(request_headers: dict): ctx = propagate.extract(request_headers) # ctx now contains the parent span context with tracer.start_as_current_span("handle", context=ctx) as span: pass # Baggage — propagate business context downstream (use cautiously) from opentelemetry.baggage import set_baggage, get_baggage ctx = set_baggage("tenant.id", "acme-corp") # Downstream services can read: get_baggage("tenant.id")
⚠️
Baggage is propagated to all downstream services and is visible in headers. Never put sensitive data (PII, secrets, tokens) in baggage. It's for low-cardinality routing hints like tenant ID or feature flags — not user data.
📊

Metrics are aggregated numerical measurements over time. Unlike traces (one per request), metrics collapse many observations into a single time-series data point. They're cheap to store, fast to query, and ideal for dashboards and alerts.

A metric is identified by its name, unit, description, and a set of attributes (dimensions). The combination of metric name + attributes = a unique time series.

Cardinality explosion. Adding a high-cardinality attribute (like user_id or request_id) to a metric creates one time series per unique value. 1M users = 1M series. This will make your backend OOM and your bill astronomical. Attributes on metrics must be low-cardinality. Rule of thumb: < 100 unique values per attribute.

Metric data model — the four types of data points

Data TypeDescriptionAggregation
SumCumulative or delta sum of measurements. Monotonic (always increasing) or non-monotonic.Sum over interval
GaugeInstantaneous value — no aggregation across time. Current memory, CPU, queue depth.Last value
HistogramDistribution of values — bucket counts, sum, count. Powers p50/p99/p999 latency.Configurable buckets
ExponentialHistogramAdaptive-bucket histogram — higher resolution without pre-configuring buckets.Adaptive buckets
🎚️
InstrumentMeasurementDefault AggregationUse for
Counter Monotonically increasing value you add to Sum Requests processed, bytes sent, errors, jobs completed
UpDownCounter Value that can increase or decrease Sum Active connections, queue depth, items in cache
Histogram Individual measurements — builds distribution Explicit bucket histogram Request latency, response size, batch sizes
Gauge Snapshot of current value Last value CPU %, memory used, fan speed, temperature
ObservableCounter Counter value read via callback Sum System counters you read (not control): CPU time, GC collections
ObservableGauge Gauge value read via callback Last value System metrics: heap used, thread count, connection pool size
ObservableUpDownCounter UpDownCounter read via callback Sum Number of goroutines, active DB connections (pull-style)
💡
Synchronous vs Observable: Synchronous instruments (Counter, Histogram) are recorded at the time of the event — "I just processed a request." Observable instruments are polled on each collection interval via a callback — "how many goroutines exist right now?" Use Observable when you're reading system state, not when you're counting events.
💻
pythonmetrics.py
from opentelemetry import metrics from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter # 1. Configure MeterProvider reader = PeriodicExportingMetricReader( OTLPMetricExporter(endpoint="http://collector:4317"), export_interval_millis=60_000 # export every 60s ) provider = MeterProvider(resource=resource, metric_readers=[reader]) metrics.set_meter_provider(provider) # 2. Get a meter — once per library/component meter = metrics.get_meter("order_service", "2.4.1") # 3. Create instruments — at module load time, not per request order_counter = meter.create_counter( "orders.placed", unit="{orders}", description="Total number of orders placed", ) latency_histogram = meter.create_histogram( "orders.processing.duration", unit="s", description="Time to process an order end-to-end", ) active_orders = meter.create_up_down_counter( "orders.in_flight", unit="{orders}", description="Orders currently being processed", ) # 4. Record measurements — with low-cardinality attributes only def process_order(order): attrs = {"order.type": order.type, "region": order.region} active_orders.add(1, attrs) start = time.monotonic() try: result = do_process(order) attrs["outcome"] = "success" return result except Exception: attrs["outcome"] = "error" raise finally: latency_histogram.record(time.monotonic() - start, attrs) order_counter.add(1, attrs) active_orders.add(-1, attrs) # 5. Observable gauge — reads system state via callback def observe_queue_depth(options): yield metrics.Observation(get_queue_depth(), {"queue": "orders"}) meter.create_observable_gauge( "orders.queue.depth", callbacks=[observe_queue_depth], unit="{items}", )
📐
Naming convention: Use dot-separated hierarchical names: http.server.request.duration, db.client.connections.count. Follow the OTel semantic conventions naming schema. Unit goes in the unit field, not the name ("ms", "s", "By", "{requests}").
📋

OTel Logs are the newest of the three signals (GA'd after traces and metrics). The OTel Logs API does not replace your logging library — it bridges your existing logger (loguru, structlog, winston, serilog) into the OTel pipeline, injecting trace_id and span_id automatically.

OTel Log Record

A structured event with a timestamp, severity, body (message), and a map of attributes. Plus trace_id and span_id when emitted in a trace context.

Log Bridge API

Connects your existing logging library to the OTel LoggerProvider. The bridge reads log records from the library and translates them into OTel log records — no rewrite required.

Severity levels mapping

OTel SeverityNumberSyslogLog4jPython logging
TRACE1–4TRACE
DEBUG5–8debugDEBUGDEBUG (10)
INFO9–12infoINFOINFO (20)
WARN13–16warningWARNWARNING (30)
ERROR17–20errorERRORERROR (40)
FATAL21–24crit/alert/emergFATALCRITICAL (50)
💻
pythonlogging_setup.py
# pip install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-logging import logging from opentelemetry._logs import set_logger_provider from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler from opentelemetry.sdk._logs.export import BatchLogRecordProcessor from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter # 1. Configure the OTel LoggerProvider log_provider = LoggerProvider(resource=resource) log_provider.add_log_record_processor( BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://collector:4317")) ) set_logger_provider(log_provider) # 2. Bridge Python's stdlib logging → OTel handler = LoggingHandler(level=logging.NOTSET, logger_provider=log_provider) logging.getLogger().addHandler(handler) # 3. Use your logger normally — trace_id injected automatically logger = logging.getLogger("order_service.checkout") def checkout(cart_id: str): with tracer.start_as_current_span("checkout"): # This log will automatically have trace_id + span_id attached logger.info("Starting checkout", extra={"cart.id": cart_id}) try: result = process(cart_id) logger.info("Checkout complete", extra={"order.id": result.id}) except Exception as e: logger.error("Checkout failed", exc_info=e, extra={"cart.id": cart_id})
✓ DO — structured, low-cardinality
# Structured fields — searchable logger.info("Order placed", extra={ "order.id": order.id, "order.type": order.type, "order.region": order.region, }) # Exception with full stack trace logger.error("Payment failed", exc_info=True, extra={"payment.provider": "stripe"} )
✕ DON'T — unstructured, unsearchable
# String interpolation — can't query by field logger.info( f"Order {order.id} placed by {user.email}" ) # PII in logs (GDPR violation!) logger.debug(f"User card: {card_number}") # Swallowing the exception object except Exception as e: logger.error(str(e)) # loses stack trace
🔄

The OTel Collector is a standalone binary (or Docker image) that receives telemetry from your services, processes it, and exports it to backends. It's the recommended deployment pattern for production — your services only talk to a local Collector, never directly to a backend.

📥
Receivers
OTLP · Jaeger
Prometheus · Zipkin
⚙️
Processors
Batch · Filter
Enrich · Sample
📤
Exporters
OTLP · Jaeger
Prometheus · Custom
yamlcollector-config.yaml
# OTel Collector configuration receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 # gRPC http: endpoint: 0.0.0.0:4318 # HTTP/JSON prometheus: config: scrape_configs: - job_name: 'my-app' static_configs: - targets: ['app:8080'] processors: batch: timeout: 5s send_batch_size: 1000 memory_limiter: limit_mib: 512 check_interval: 1s resource: attributes: - key: deployment.environment value: "production" action: upsert # add/overwrite resource attributes filter/drop_health_spans: # drop noisy health-check traces error_mode: ignore traces: span: - 'attributes["http.target"] == "/health"' tail_sampling: # keep all errors + 10% of success decision_wait: 10s policies: - name: errors-policy type: status_code status_code: { status_codes: [ERROR] } - name: probabilistic-policy type: probabilistic probabilistic: { sampling_percentage: 10 } exporters: otlp/jaeger: endpoint: jaeger:4317 tls: insecure: true otlp/honeycomb: endpoint: api.honeycomb.io:443 headers: x-honeycomb-team: ${env:HONEYCOMB_API_KEY} prometheusremotewrite: endpoint: http://prometheus:9090/api/v1/write debug: verbosity: detailed service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch, filter/drop_health_spans, tail_sampling] exporters: [otlp/jaeger, otlp/honeycomb] metrics: receivers: [otlp, prometheus] processors: [memory_limiter, batch, resource] exporters: [prometheusremotewrite] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp/honeycomb]

Collector distributions

DistributionUse when
otelcol (core)Minimal build — only OTel-maintained components. Recommended for most.
otelcol-contribFull distribution — all community components. Good for evaluating; large binary.
OCB (OTel Collector Builder)Production — build a custom binary with only the components you need. Minimizes attack surface and binary size.
Vendor distributionsGrafana Agent, Elastic Agent, Datadog Agent — vendor-managed Collectors with extra integrations.
📤
ExporterProtocolUse for
OTLP/gRPCProtobuf over gRPCDefault. Best performance. Use in production.
OTLP/HTTPProtobuf or JSON over HTTPEnvironments where gRPC is blocked (some firewalls, browsers).
JaegerJaeger Thrift / gRPCDirect to Jaeger without Collector (legacy pattern).
ZipkinZipkin JSONLegacy Zipkin backends.
PrometheusPrometheus text/pullExpose metrics on /metrics for Prometheus scraping.
ConsoleStdout JSONDevelopment / debugging only. Never in production.
In-memoryIn-processTesting — assertions on exported spans/metrics in unit tests.
Always use OTLP + Collector in production. Export to localhost:4317 (your Collector sidecar). Never hardcode backend-specific exporters in application code — that recreates vendor lock-in at the SDK level. The Collector handles routing.
⚙️

Environment variables — configure without code changes

bash.env / Kubernetes env
# Service identity (REQUIRED) OTEL_SERVICE_NAME="order-service" OTEL_SERVICE_VERSION="2.4.1" # Collector endpoint OTEL_EXPORTER_OTLP_ENDPOINT="http://collector:4317" # all signals OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://collector:4317" # traces only OTEL_EXPORTER_OTLP_PROTOCOL="grpc" # or http/protobuf # Resource attributes OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,k8s.cluster.name=prod-us-east" # Sampling OTEL_TRACES_SAMPLER="parentbased_traceidratio" OTEL_TRACES_SAMPLER_ARG="0.1" # sample 10% of root spans # Propagation OTEL_PROPAGATORS="tracecontext,baggage" # W3C TraceContext + Baggage # Disable specific signals OTEL_TRACES_EXPORTER="none" # disable traces OTEL_METRICS_EXPORTER="none" # disable metrics

Sampling strategies

SamplerBehaviorUse for
always_onSample 100% of tracesDevelopment, low-traffic services
always_offDrop all tracesDisabling tracing temporarily
traceidratioSample N% of all traces regardless of parentRarely useful — ignores parent sampling decision
parentbased_always_onRespect parent sampling; always_on for rootsDev/test environments
parentbased_traceidratioRespect parent sampling; ratio for rootsProduction — standard choice
Tail sampling (Collector)Sample after seeing the full traceKeep all errors + fast traces, sample slow traces
🎯
Tail sampling is the most powerful strategy: the Collector waits for the entire trace to arrive (configurable buffer window, e.g. 10s), then decides whether to keep or drop it based on the full picture — was there an error? Was it slow? Did it touch a specific service? This is impossible with head sampling (which decides at the start of a trace).
📐

Semantic conventions define standardized attribute names for common operations. Using them means your telemetry works out-of-the-box with dashboards, queries, and alerting rules built for OTel. Deviating from them means building everything from scratch.

DomainKey Attributes
HTTP Server http.request.method, http.response.status_code, url.path, server.address, http.route
HTTP Client http.request.method, http.response.status_code, url.full, server.address, server.port
Database db.system, db.name, db.operation, db.statement, server.address, db.sql.table
Messaging messaging.system, messaging.destination.name, messaging.operation, messaging.message.id
RPC rpc.system, rpc.service, rpc.method, rpc.grpc.status_code
Exceptions exception.type, exception.message, exception.stacktrace
Service service.name, service.version, service.instance.id
Kubernetes k8s.pod.name, k8s.namespace.name, k8s.deployment.name, k8s.cluster.name
Cloud cloud.provider, cloud.region, cloud.account.id, cloud.availability_zone
📖
Stability: Semantic conventions have stability markers: stable, experimental, and deprecated. The HTTP and database conventions stabilized in 2024. Check opentelemetry.io/docs/specs/semconv before naming custom attributes — the convention you want may already exist.
🎯
Question / Use Case Trace Metric Log
Why did this specific request fail? PRIMARY hints details
What is my p99 latency over the last 24h? too costly PRIMARY no
How many orders per second are we processing? no PRIMARY no
Which microservice is the bottleneck in my request path? PRIMARY confirm no
Did the background job succeed for tenant X? if traced no PRIMARY
Alert when error rate exceeds 1% no PRIMARY some tools
What SQL queries ran during a slow request? PRIMARY no verbose mode
How many users triggered feature flag X today? no PRIMARY if logged
Audit trail — who deleted record 12345? no no PRIMARY
Is this a new deployment causing elevated errors? sample traces PRIMARY confirm
🔁
The observability loop: Alerts fire on metrics. You jump to traces to find the specific bad request. You read logs to understand exactly what happened inside it. Each signal leads you to the next. This is why correlation (shared trace_id) between all three is the real value of OTel.
🏗️

Pattern 1 — Auto-instrumentation first

Before writing a single line of manual instrumentation, install the auto-instrumentation package for your framework. It instruments all HTTP calls, database queries, messaging, and more with zero code. Add manual spans only for business logic that auto-instrumentation can't see.

bashzero-code instrumentation
# Python — auto-instrument Flask + SQLAlchemy + Redis with zero code changes pip install opentelemetry-instrumentation-flask \ opentelemetry-instrumentation-sqlalchemy \ opentelemetry-instrumentation-redis opentelemetry-instrument \ --traces_exporter console \ --service_name my-flask-app \ python app.py # Node.js — auto-instrument Express + Postgres + Redis npm install @opentelemetry/auto-instrumentations-node node --require @opentelemetry/auto-instrumentations-node/register app.js # Java — zero-code Java agent (attach to JVM) java -javaagent:opentelemetry-javaagent.jar \ -Dotel.service.name=my-java-svc \ -Dotel.exporter.otlp.endpoint=http://collector:4317 \ -jar myapp.jar

Pattern 2 — Kubernetes sidecar deployment

yamldeployment.yaml
spec: containers: - name: app image: myapp:2.4.1 env: - name: OTEL_SERVICE_NAME value: "order-service" - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://localhost:4317" # sidecar on localhost - name: OTEL_RESOURCE_ATTRIBUTES value: "k8s.pod.name=$(MY_POD_NAME),k8s.namespace.name=$(MY_POD_NAMESPACE)" - name: otel-collector image: otel/opentelemetry-collector-contrib:latest args: ["--config=/etc/otel/config.yaml"] ports: - containerPort: 4317 # gRPC - containerPort: 4318 # HTTP volumeMounts: - name: otel-config mountPath: /etc/otel volumes: - name: otel-config configMap: name: otel-collector-config

Pattern 3 — Correlation: linking metrics to traces

pythonexemplar_demo.py
from opentelemetry import trace, metrics, context # Exemplars are attached automatically when the SDK detects # an active span at recording time — no extra code needed. # The histogram observation will carry the current trace_id. request_duration = meter.create_histogram( "http.server.request.duration", unit="s", ) def handle_request(path: str): with tracer.start_as_current_span("handle_request"): start = time.monotonic() response = route(path) # This histogram record automatically gets an exemplar # pointing to the current span — no manual work needed request_duration.record( time.monotonic() - start, {"http.route": path, "http.response.status_code": response.status} ) return response # In Grafana: when you see a p99 spike, click the exemplar # to jump directly to the corresponding trace in Tempo/Jaeger.

Pattern 4 — Testing your instrumentation

pythontest_instrumentation.py
from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter from opentelemetry.sdk.trace.export import SimpleSpanProcessor def test_checkout_creates_span_with_cart_id(): # Use in-memory exporter — no real backend needed exporter = InMemorySpanExporter() provider = TracerProvider() provider.add_span_processor(SimpleSpanProcessor(exporter)) with trace_as(provider): checkout(cart_id="cart-123") spans = exporter.get_finished_spans() assert len(spans) == 1 span = spans[0] assert span.name == "OrderService.checkout" assert span.attributes["cart.id"] == "cart-123" assert span.status.status_code == StatusCode.OK
🖥️
ToolSignalsOSS?Notes
Jaeger Traces ✅ CNCF The OSS trace standard. Native OTLP support. Excellent for distributed tracing UI.
Grafana Tempo Traces ✅ OSS Scalable trace storage, integrates with Grafana. Tag-based search, exemplar linking with Prometheus.
Prometheus Metrics ✅ CNCF The OSS metrics standard. Pull-based. Use with OTEL Prometheus exporter or remote write from Collector.
Grafana Mimir Metrics ✅ OSS Scalable Prometheus-compatible metrics backend. HA, long-term storage, multi-tenant.
Grafana Loki Logs ✅ OSS Log aggregation designed for Grafana. Label-based indexing, LogQL query language.
Elastic (APM) Traces + Metrics + Logs ⚠️ Source-avail Full OTel support via OTLP. Powerful search. Free self-hosted tier.
Honeycomb Traces + Logs ❌ Commercial Best-in-class distributed tracing UX. Wide-event model. Native OTLP. Generous free tier.
Datadog Traces + Metrics + Logs ❌ Commercial Full platform. Native OTel support. Premium cost; industry standard in enterprise.
Grafana Cloud Traces + Metrics + Logs ❌ Managed OSS Managed Tempo + Mimir + Loki + Grafana. Generous free tier. Native OTLP ingest.
AWS X-Ray / CloudWatch Traces + Metrics + Logs ❌ Commercial Tightly integrated with AWS. OTel support via ADOT (AWS Distro for OTel).
📚

The ten most common OTel mistakes

MISTAKE
High-cardinality metric attributes
Adding user_id or request_id as metric attributes. This creates millions of time series and crashes backends. Metrics attributes must be low-cardinality (<100 values).
MISTAKE
Forgetting to end spans
Spans that are never ended leak memory and are never exported. Always use try/finally, context managers, or the SDK's callback-style API.
MISTAKE
High-cardinality span names
"GET /users/12345" instead of "GET /users/{id}". Span names are used for aggregation — variable data goes in attributes, not names.
MISTAKE
Library code installing the SDK
Libraries that call TracerProvider.install() will break every application using them. Libraries use the API only. Applications own SDK configuration.
MISTAKE
Exporting directly to backend from app
Hard-coding backend-specific exporters in application code recreates vendor lock-in. Export to the OTel Collector via OTLP. Let the Collector route to backends.
MISTAKE
Missing resource attributes
Not setting service.name, service.version, deployment.environment. Without these, you can't filter telemetry by service in any backend UI.
MISTAKE
100% sampling in production
Sampling every trace from a high-traffic service is prohibitively expensive. Use parentbased_traceidratio at 10% or tail sampling in the Collector to keep errors.
MISTAKE
PII in telemetry
Logging user emails, card numbers, or passwords in span attributes or log records. Telemetry backends are not GDPR-compliant stores for personal data. Scrub PII at the Collector or before recording.
MISTAKE
Using Console exporter in production
The ConsoleExporter / StdoutExporter is for development only. In production it floods stdout with JSON and loses all telemetry. Use OTLP.
MISTAKE
Ignoring propagation
Not injecting traceparent headers into outgoing HTTP calls or message queue payloads. Without this, traces are broken — every service looks like a standalone request.

Official resources

Core Documentation
📡
OTel is infrastructure, not a feature. The best-instrumented services are the ones where you can jump from an alert to the root cause in under 60 seconds. OTel gives you the plumbing; good signal naming, sampling, and correlation give you the power. The investment pays for itself the first time you diagnose an incident in production with a full trace instead of grep.