The vendor-neutral industry standard for instrumentation. Learn to emit, collect, and export traces, metrics, and logs from any service — and actually understand what your systems are doing in production.
// otel.io — the observability framework, not a backend
OpenTelemetry (OTel) is a CNCF project that provides a unified API, SDK, and tooling for collecting observability data from your services. It's not a monitoring backend — it's the instrumentation layer that sits between your code and the backends you already use (Datadog, Grafana, Jaeger, Honeycomb, etc.).
Before OTel, every vendor had its own SDK. If you used Datadog's tracer, you were locked in. OTel breaks that coupling: instrument once with the OTel API, route telemetry anywhere.
Vendor-Neutral
Write instrumentation once against the OTel API. Switch backends by changing a config file — no code changes. Jaeger today, Honeycomb tomorrow.
Three Signals
Traces, Metrics, and Logs under one unified model with correlated context. See a metric spike, jump to the trace, read the log — all linked by trace_id.
Any Language
SDKs for Go, Python, Java, JavaScript/Node, .NET, Rust, Ruby, PHP, and more. Auto-instrumentation for popular frameworks requires zero code changes.
🎯
The golden rule: OTel is the pipe, not the destination. You still need a backend that stores and queries telemetry data. OTel just ensures the data gets there in a standard format (OTLP — OpenTelemetry Protocol) regardless of which backend you choose.
OTel vs. what came before
Aspect
Old World (vendor SDKs)
OpenTelemetry
Instrumentation
Per-vendor SDK in every service
One OTel SDK, any backend
Lock-in
High — migration = re-instrumentation
Zero — swap exporters, keep code
Correlation
Traces and logs siloed
Unified trace_id / span_id across all signals
Format
Proprietary per vendor
OTLP (open standard)
Auto-instrumentation
Vendor-specific agents
OTel zero-code instrumentation
Governance
Vendor-controlled
CNCF — open governance
⚡
The Three Signals
// traces · metrics · logs — what they are, what they answer
Each signal answers a different observability question. They are complementary — not interchangeable. Using all three gives you the full picture; using only one leaves blind spots.
Signal
Answers
Shape of Data
Cardinality
Storage Cost
● Trace
"What happened and how long did each step take?"
Tree of spans with timings, attributes, events
High — one trace per request
High — sampled to manage
● Metric
"How is my system performing over time?"
Numerical time-series with labels/dimensions
Low-medium — aggregated
Low — pre-aggregated
● Log
"What exactly happened at this moment?"
Timestamped text/JSON with structured fields
Very high — every event
Very high — raw events
Signal correlation — the real power
OTel injects trace_id and span_id into logs automatically when you use the OTel Logs bridge. This means every log line emitted during a request is linkable to its trace, and every metric can carry exemplars pointing to a sample trace.
🔗
Exemplars: A Prometheus metric can carry an exemplar — a pointer to a specific trace ID that caused a high-latency observation. This bridges the gap between "p99 spiked at 14:32" and "here's the actual slow request in Jaeger."
🧱
Core Concepts
// API · SDK · providers · resources · context
API
OTel API
Stable interfaces you write instrumentation against. Zero implementation — it's a no-op until an SDK is configured. Libraries should depend only on the API, never the SDK.
SDK
OTel SDK
The implementation of the API. Handles sampling, batching, and exporting. Applications configure and install the SDK at startup. Libraries must never install an SDK.
RESOURCE
Resource
Immutable description of the entity producing telemetry. Set once at startup: service.name, service.version, deployment.environment, k8s.pod.name, etc.
PROVIDER
TracerProvider / MeterProvider / LoggerProvider
The entry point for creating tracers, meters, and loggers. Configure exporters and processors here. Treat as a singleton — one per signal per process.
CONTEXT
Context & Propagation
Carries trace_id, span_id, and baggage across process boundaries via HTTP headers (W3C traceparent), gRPC metadata, or message queues.
OTLP
OTLP Protocol
OpenTelemetry Protocol — the standard wire format (Protobuf over gRPC or HTTP/JSON). The OTel Collector and all major backends speak OTLP natively.
⚠️
Library authors: Only depend on opentelemetry-api. Never depend on or configure the SDK. If a library configures a TracerProvider, it will break applications that have their own SDK setup. The application owns SDK configuration — always.
🔭
Architecture Overview
// how telemetry flows from code to backend
📱
Your Code
App / Service
→
📦
OTel SDK
API + SDK
→
🔄
Collector
OTel Collector
→
🔀
Processor
Filter / Enrich
→
📤
Exporter
OTLP / Native
→
🖥️
Backend
Jaeger / Grafana…
The OTel Collector is optional but strongly recommended in production. Without it, your services export directly to backends, coupling them to backend availability and format. With it, you gain a single control plane: filter noisy spans, add attributes, fan out to multiple backends, and change destinations without touching application code.
Direct Export (Dev / Simple)
Service → OTLP Exporter → Backend directly. Fine for development or very simple setups. No Collector needed. Trade-off: every service is coupled to backend configuration.
Collector Sidecar (Production)
Service → local Collector → central Collector gateway → backends. Services only know about localhost:4317. The Collector fleet handles routing, retry, batching, and fanout.
🔍
Tracing Concepts ● Trace
// distributed request tracing — follow a request across services
A trace is the complete record of a single request as it travels through your system. It's a directed acyclic graph (usually visualized as a waterfall) where each node is a span — a unit of work with a start time, duration, and metadata.
Trace
The entire journey of one request. Identified by a trace_id (128-bit UUID). All spans sharing a trace_id belong to the same request, regardless of which service created them.
Span
A single unit of work: "DB query", "HTTP call", "message publish". Has a span_id, a reference to its parent span, a start time, duration, status, and key-value attributes.
Span status codes
Status
When to use
Sets Error?
UNSET
Default — no explicit status set. The operation completed, but hasn't been evaluated for success or error.
No
OK
Explicitly mark as successful. Use sparingly — only when a downstream system needs to override error status from a library.
No
ERROR
An error occurred. Set the exception event and error.type attribute. This will show as a red span in tracing UIs.
Yes
✅
Span status rule: Only set ERROR when your layer considers the operation failed. A 404 response is not an error in the HTTP client span — it's a valid response. But it may be an error in the business logic span above it. Think in terms of responsibility boundaries.
📏
Spans & Attributes ● Trace
// naming, attributes, events, links
Span naming — the most important thing you'll get wrong
Span names are used for aggregation in tracing UIs. A bad name means you can't group related spans. A span name must be low-cardinality — do not include dynamic data like IDs or user emails in the name.
// Never include dynamic values in span name"GET /users/12345"✗ user ID"Processing order 98765"✗ order ID// Too vague"process"✗ meaningless"handler"✗ meaningless// Includes PII"user@company.com login"✗ PII in name
Attributes
Attributes are key-value pairs attached to a span. They provide context for analysis. Follow semantic conventions for standard attributes — roll your own only for domain-specific data.
pythoninstrumentation.py
from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes
tracer = trace.get_tracer("my.library", "1.0.0")
with tracer.start_as_current_span("OrderService.PlaceOrder") as span:
# Standard semantic convention attributes
span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")
span.set_attribute(SpanAttributes.DB_NAME, "orders")
span.set_attribute(SpanAttributes.DB_STATEMENT, "INSERT INTO orders ...")
# Custom domain attributes — use reverse-DNS namespace
span.set_attribute("order.customer_id", customer_id) # low-cardinality
span.set_attribute("order.item_count", len(items))
span.set_attribute("order.total_usd", total)
# Record an exception — this sets status=ERROR automaticallytry:
result = place_order(order)
exceptExceptionas e:
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
raise# Events — timestamped annotations inside a span
span.add_event("payment.authorized", {"payment.provider": "stripe"})
span.add_event("inventory.reserved", {"warehouse.id": "WH-03"})
⚠️
Attribute cardinality kills performance. Never use attributes with unbounded values (user IDs, request IDs, raw SQL queries) as metric dimensions. In traces it's fine — in metrics it will blow up your backend's cardinality limits and cost you thousands per month.
💻
Trace API Usage ● Trace
// creating tracers, spans, and wiring SDK
Python
pythonapp.py
# pip install opentelemetry-sdk opentelemetry-exporter-otlpfrom opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
# 1. Configure resource (who is sending this?)
resource = Resource.create({
SERVICE_NAME: "order-service",
"service.version": "2.4.1",
"deployment.environment": "production",
})
# 2. Build the TracerProvider with exporter
provider = TracerProvider(resource=resource)
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)
# 3. Get a tracer — do this once per module/library
tracer = trace.get_tracer("order_service.checkout", "2.4.1")
# 4. Create spansdef checkout(cart_id: str):
with tracer.start_as_current_span("checkout") as span:
span.set_attribute("cart.id", cart_id)
validate_cart(cart_id) # child spans created inside here
charge_payment(cart_id) # are automatically children of "checkout"# 5. Manual child span with context managerdef validate_cart(cart_id: str):
with tracer.start_as_current_span("CartService.validate") as span:
span.set_attribute("cart.id", cart_id)
# ... business logic
Always end spans. A span that is never ended will leak memory and may never be exported. Use try/finally, context managers (Python), or the SDK's startActiveSpan callback pattern (JS) which ends the span automatically.
Context propagation is how a trace_id flows from Service A to Service B to Service C. Without it, you get disconnected spans that each look like independent requests. With it, you get the full distributed waterfall.
Propagator
Header
Use when
W3C TraceContext
traceparent, tracestate
Default. Use this everywhere. IETF standard.
B3 Single
b3
Legacy — services using Zipkin/older Brave instrumentation
from opentelemetry import propagate
import requests
# Inject context into outgoing HTTP headers (auto done by instrumented libs)def call_inventory_service(product_id: str):
headers = {}
propagate.inject(headers) # adds traceparent + baggage headers
response = requests.get(
f"http://inventory-svc/products/{product_id}",
headers=headers
)
return response.json()
# Extract context from incoming request (frameworks handle this automatically)def handle_request(request_headers: dict):
ctx = propagate.extract(request_headers)
# ctx now contains the parent span contextwith tracer.start_as_current_span("handle", context=ctx) as span:
pass# Baggage — propagate business context downstream (use cautiously)from opentelemetry.baggage import set_baggage, get_baggage
ctx = set_baggage("tenant.id", "acme-corp")
# Downstream services can read: get_baggage("tenant.id")
⚠️
Baggage is propagated to all downstream services and is visible in headers. Never put sensitive data (PII, secrets, tokens) in baggage. It's for low-cardinality routing hints like tenant ID or feature flags — not user data.
📊
Metrics Concepts ● Metric
// time series, dimensions, aggregation
Metrics are aggregated numerical measurements over time. Unlike traces (one per request), metrics collapse many observations into a single time-series data point. They're cheap to store, fast to query, and ideal for dashboards and alerts.
A metric is identified by its name, unit, description, and a set of attributes (dimensions). The combination of metric name + attributes = a unique time series.
⛔
Cardinality explosion. Adding a high-cardinality attribute (like user_id or request_id) to a metric creates one time series per unique value. 1M users = 1M series. This will make your backend OOM and your bill astronomical. Attributes on metrics must be low-cardinality. Rule of thumb: < 100 unique values per attribute.
Metric data model — the four types of data points
Data Type
Description
Aggregation
Sum
Cumulative or delta sum of measurements. Monotonic (always increasing) or non-monotonic.
Sum over interval
Gauge
Instantaneous value — no aggregation across time. Current memory, CPU, queue depth.
Last value
Histogram
Distribution of values — bucket counts, sum, count. Powers p50/p99/p999 latency.
Configurable buckets
ExponentialHistogram
Adaptive-bucket histogram — higher resolution without pre-configuring buckets.
Adaptive buckets
🎚️
Instrument Types ● Metric
// counter · gauge · histogram · updown counter — which to use when
Instrument
Measurement
Default Aggregation
Use for
Counter
Monotonically increasing value you add to
Sum
Requests processed, bytes sent, errors, jobs completed
UpDownCounter
Value that can increase or decrease
Sum
Active connections, queue depth, items in cache
Histogram
Individual measurements — builds distribution
Explicit bucket histogram
Request latency, response size, batch sizes
Gauge
Snapshot of current value
Last value
CPU %, memory used, fan speed, temperature
ObservableCounter
Counter value read via callback
Sum
System counters you read (not control): CPU time, GC collections
ObservableGauge
Gauge value read via callback
Last value
System metrics: heap used, thread count, connection pool size
ObservableUpDownCounter
UpDownCounter read via callback
Sum
Number of goroutines, active DB connections (pull-style)
💡
Synchronous vs Observable: Synchronous instruments (Counter, Histogram) are recorded at the time of the event — "I just processed a request." Observable instruments are polled on each collection interval via a callback — "how many goroutines exist right now?" Use Observable when you're reading system state, not when you're counting events.
💻
Metrics API Usage ● Metric
// creating meters and instruments in Python and JS
pythonmetrics.py
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
# 1. Configure MeterProvider
reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://collector:4317"),
export_interval_millis=60_000# export every 60s
)
provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)
# 2. Get a meter — once per library/component
meter = metrics.get_meter("order_service", "2.4.1")
# 3. Create instruments — at module load time, not per request
order_counter = meter.create_counter(
"orders.placed",
unit="{orders}",
description="Total number of orders placed",
)
latency_histogram = meter.create_histogram(
"orders.processing.duration",
unit="s",
description="Time to process an order end-to-end",
)
active_orders = meter.create_up_down_counter(
"orders.in_flight",
unit="{orders}",
description="Orders currently being processed",
)
# 4. Record measurements — with low-cardinality attributes onlydef process_order(order):
attrs = {"order.type": order.type, "region": order.region}
active_orders.add(1, attrs)
start = time.monotonic()
try:
result = do_process(order)
attrs["outcome"] = "success"return result
exceptException:
attrs["outcome"] = "error"raisefinally:
latency_histogram.record(time.monotonic() - start, attrs)
order_counter.add(1, attrs)
active_orders.add(-1, attrs)
# 5. Observable gauge — reads system state via callbackdef observe_queue_depth(options):
yield metrics.Observation(get_queue_depth(), {"queue": "orders"})
meter.create_observable_gauge(
"orders.queue.depth",
callbacks=[observe_queue_depth],
unit="{items}",
)
📐
Naming convention: Use dot-separated hierarchical names: http.server.request.duration, db.client.connections.count. Follow the OTel semantic conventions naming schema. Unit goes in the unit field, not the name ("ms", "s", "By", "{requests}").
📋
Log Signals ● Log
// structured logs with trace correlation
OTel Logs are the newest of the three signals (GA'd after traces and metrics). The OTel Logs API does not replace your logging library — it bridges your existing logger (loguru, structlog, winston, serilog) into the OTel pipeline, injecting trace_id and span_id automatically.
OTel Log Record
A structured event with a timestamp, severity, body (message), and a map of attributes. Plus trace_id and span_id when emitted in a trace context.
Log Bridge API
Connects your existing logging library to the OTel LoggerProvider. The bridge reads log records from the library and translates them into OTel log records — no rewrite required.
Severity levels mapping
OTel Severity
Number
Syslog
Log4j
Python logging
TRACE
1–4
—
TRACE
—
DEBUG
5–8
debug
DEBUG
DEBUG (10)
INFO
9–12
info
INFO
INFO (20)
WARN
13–16
warning
WARN
WARNING (30)
ERROR
17–20
error
ERROR
ERROR (40)
FATAL
21–24
crit/alert/emerg
FATAL
CRITICAL (50)
💻
Logs API Usage ● Log
// bridging existing loggers · trace correlation
pythonlogging_setup.py
# pip install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-loggingimport logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
# 1. Configure the OTel LoggerProvider
log_provider = LoggerProvider(resource=resource)
log_provider.add_log_record_processor(
BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://collector:4317"))
)
set_logger_provider(log_provider)
# 2. Bridge Python's stdlib logging → OTel
handler = LoggingHandler(level=logging.NOTSET, logger_provider=log_provider)
logging.getLogger().addHandler(handler)
# 3. Use your logger normally — trace_id injected automatically
logger = logging.getLogger("order_service.checkout")
def checkout(cart_id: str):
with tracer.start_as_current_span("checkout"):
# This log will automatically have trace_id + span_id attached
logger.info("Starting checkout", extra={"cart.id": cart_id})
try:
result = process(cart_id)
logger.info("Checkout complete", extra={"order.id": result.id})
exceptExceptionas e:
logger.error("Checkout failed", exc_info=e,
extra={"cart.id": cart_id})
# String interpolation — can't query by field
logger.info(
f"Order {order.id} placed by {user.email}"
)
# PII in logs (GDPR violation!)
logger.debug(f"User card: {card_number}")
# Swallowing the exception objectexceptExceptionas e:
logger.error(str(e)) # loses stack trace
🔄
OTel Collector
// receive · process · export — the telemetry control plane
The OTel Collector is a standalone binary (or Docker image) that receives telemetry from your services, processes it, and exports it to backends. It's the recommended deployment pattern for production — your services only talk to a local Collector, never directly to a backend.
Minimal build — only OTel-maintained components. Recommended for most.
otelcol-contrib
Full distribution — all community components. Good for evaluating; large binary.
OCB (OTel Collector Builder)
Production — build a custom binary with only the components you need. Minimizes attack surface and binary size.
Vendor distributions
Grafana Agent, Elastic Agent, Datadog Agent — vendor-managed Collectors with extra integrations.
📤
Exporters
// how telemetry leaves the SDK and reaches backends
Exporter
Protocol
Use for
OTLP/gRPC
Protobuf over gRPC
Default. Best performance. Use in production.
OTLP/HTTP
Protobuf or JSON over HTTP
Environments where gRPC is blocked (some firewalls, browsers).
Jaeger
Jaeger Thrift / gRPC
Direct to Jaeger without Collector (legacy pattern).
Zipkin
Zipkin JSON
Legacy Zipkin backends.
Prometheus
Prometheus text/pull
Expose metrics on /metrics for Prometheus scraping.
Console
Stdout JSON
Development / debugging only. Never in production.
In-memory
In-process
Testing — assertions on exported spans/metrics in unit tests.
✅
Always use OTLP + Collector in production. Export to localhost:4317 (your Collector sidecar). Never hardcode backend-specific exporters in application code — that recreates vendor lock-in at the SDK level. The Collector handles routing.
Environment variables — configure without code changes
bash.env / Kubernetes env
# Service identity (REQUIRED)
OTEL_SERVICE_NAME="order-service"
OTEL_SERVICE_VERSION="2.4.1"# Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT="http://collector:4317"# all signals
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://collector:4317"# traces only
OTEL_EXPORTER_OTLP_PROTOCOL="grpc"# or http/protobuf# Resource attributes
OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,k8s.cluster.name=prod-us-east"# Sampling
OTEL_TRACES_SAMPLER="parentbased_traceidratio"
OTEL_TRACES_SAMPLER_ARG="0.1"# sample 10% of root spans# Propagation
OTEL_PROPAGATORS="tracecontext,baggage"# W3C TraceContext + Baggage# Disable specific signals
OTEL_TRACES_EXPORTER="none"# disable traces
OTEL_METRICS_EXPORTER="none"# disable metrics
Sampling strategies
Sampler
Behavior
Use for
always_on
Sample 100% of traces
Development, low-traffic services
always_off
Drop all traces
Disabling tracing temporarily
traceidratio
Sample N% of all traces regardless of parent
Rarely useful — ignores parent sampling decision
parentbased_always_on
Respect parent sampling; always_on for roots
Dev/test environments
parentbased_traceidratio
Respect parent sampling; ratio for roots
Production — standard choice
Tail sampling (Collector)
Sample after seeing the full trace
Keep all errors + fast traces, sample slow traces
🎯
Tail sampling is the most powerful strategy: the Collector waits for the entire trace to arrive (configurable buffer window, e.g. 10s), then decides whether to keep or drop it based on the full picture — was there an error? Was it slow? Did it touch a specific service? This is impossible with head sampling (which decides at the start of a trace).
📐
Semantic Conventions
// standard attribute names — the OTel vocabulary
Semantic conventions define standardized attribute names for common operations. Using them means your telemetry works out-of-the-box with dashboards, queries, and alerting rules built for OTel. Deviating from them means building everything from scratch.
Stability: Semantic conventions have stability markers: stable, experimental, and deprecated. The HTTP and database conventions stabilized in 2024. Check opentelemetry.io/docs/specs/semconv before naming custom attributes — the convention you want may already exist.
🎯
When to Use What
// choosing the right signal for the right question
Question / Use Case
Trace
Metric
Log
Why did this specific request fail?
PRIMARY
hints
details
What is my p99 latency over the last 24h?
too costly
PRIMARY
no
How many orders per second are we processing?
no
PRIMARY
no
Which microservice is the bottleneck in my request path?
PRIMARY
confirm
no
Did the background job succeed for tenant X?
if traced
no
PRIMARY
Alert when error rate exceeds 1%
no
PRIMARY
some tools
What SQL queries ran during a slow request?
PRIMARY
no
verbose mode
How many users triggered feature flag X today?
no
PRIMARY
if logged
Audit trail — who deleted record 12345?
no
no
PRIMARY
Is this a new deployment causing elevated errors?
sample traces
PRIMARY
confirm
🔁
The observability loop: Alerts fire on metrics. You jump to traces to find the specific bad request. You read logs to understand exactly what happened inside it. Each signal leads you to the next. This is why correlation (shared trace_id) between all three is the real value of OTel.
🏗️
Real-World Patterns
// production-grade instrumentation patterns
Pattern 1 — Auto-instrumentation first
Before writing a single line of manual instrumentation, install the auto-instrumentation package for your framework. It instruments all HTTP calls, database queries, messaging, and more with zero code. Add manual spans only for business logic that auto-instrumentation can't see.
Pattern 3 — Correlation: linking metrics to traces
pythonexemplar_demo.py
from opentelemetry import trace, metrics, context
# Exemplars are attached automatically when the SDK detects# an active span at recording time — no extra code needed.# The histogram observation will carry the current trace_id.
request_duration = meter.create_histogram(
"http.server.request.duration",
unit="s",
)
def handle_request(path: str):
with tracer.start_as_current_span("handle_request"):
start = time.monotonic()
response = route(path)
# This histogram record automatically gets an exemplar# pointing to the current span — no manual work needed
request_duration.record(
time.monotonic() - start,
{"http.route": path, "http.response.status_code": response.status}
)
return response
# In Grafana: when you see a p99 spike, click the exemplar# to jump directly to the corresponding trace in Tempo/Jaeger.
Pattern 4 — Testing your instrumentation
pythontest_instrumentation.py
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
def test_checkout_creates_span_with_cart_id():
# Use in-memory exporter — no real backend needed
exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
with trace_as(provider):
checkout(cart_id="cart-123")
spans = exporter.get_finished_spans()
assert len(spans) == 1
span = spans[0]
assert span.name == "OrderService.checkout"
assert span.attributes["cart.id"] == "cart-123"
assert span.status.status_code == StatusCode.OK
🖥️
Backends & Tools
// where your telemetry lands — OSS and commercial
Tool
Signals
OSS?
Notes
Jaeger
Traces
✅ CNCF
The OSS trace standard. Native OTLP support. Excellent for distributed tracing UI.
Grafana Tempo
Traces
✅ OSS
Scalable trace storage, integrates with Grafana. Tag-based search, exemplar linking with Prometheus.
Prometheus
Metrics
✅ CNCF
The OSS metrics standard. Pull-based. Use with OTEL Prometheus exporter or remote write from Collector.
Grafana Mimir
Metrics
✅ OSS
Scalable Prometheus-compatible metrics backend. HA, long-term storage, multi-tenant.
Grafana Loki
Logs
✅ OSS
Log aggregation designed for Grafana. Label-based indexing, LogQL query language.
Elastic (APM)
Traces + Metrics + Logs
⚠️ Source-avail
Full OTel support via OTLP. Powerful search. Free self-hosted tier.
Tightly integrated with AWS. OTel support via ADOT (AWS Distro for OTel).
📚
Reference
// quick rules, links, and the most common mistakes
The ten most common OTel mistakes
MISTAKE
High-cardinality metric attributes
Adding user_id or request_id as metric attributes. This creates millions of time series and crashes backends. Metrics attributes must be low-cardinality (<100 values).
MISTAKE
Forgetting to end spans
Spans that are never ended leak memory and are never exported. Always use try/finally, context managers, or the SDK's callback-style API.
MISTAKE
High-cardinality span names
"GET /users/12345" instead of "GET /users/{id}". Span names are used for aggregation — variable data goes in attributes, not names.
MISTAKE
Library code installing the SDK
Libraries that call TracerProvider.install() will break every application using them. Libraries use the API only. Applications own SDK configuration.
MISTAKE
Exporting directly to backend from app
Hard-coding backend-specific exporters in application code recreates vendor lock-in. Export to the OTel Collector via OTLP. Let the Collector route to backends.
MISTAKE
Missing resource attributes
Not setting service.name, service.version, deployment.environment. Without these, you can't filter telemetry by service in any backend UI.
MISTAKE
100% sampling in production
Sampling every trace from a high-traffic service is prohibitively expensive. Use parentbased_traceidratio at 10% or tail sampling in the Collector to keep errors.
MISTAKE
PII in telemetry
Logging user emails, card numbers, or passwords in span attributes or log records. Telemetry backends are not GDPR-compliant stores for personal data. Scrub PII at the Collector or before recording.
MISTAKE
Using Console exporter in production
The ConsoleExporter / StdoutExporter is for development only. In production it floods stdout with JSON and loses all telemetry. Use OTLP.
MISTAKE
Ignoring propagation
Not injecting traceparent headers into outgoing HTTP calls or message queue payloads. Without this, traces are broken — every service looks like a standalone request.
OTel is infrastructure, not a feature. The best-instrumented services are the ones where you can jump from an alert to the root cause in under 60 seconds. OTel gives you the plumbing; good signal naming, sampling, and correlation give you the power. The investment pays for itself the first time you diagnose an incident in production with a full trace instead of grep.