Distributed Tracing with Jaeger and OpenTelemetry
Implement distributed tracing across microservices using OpenTelemetry and Jaeger to debug latency issues and understand system behavior.
When a request touches five services before returning, finding why it’s slow becomes a nightmare. Distributed tracing connects the dots, showing you exactly where time is spent. Here’s how to implement it properly with OpenTelemetry and Jaeger.
Why Distributed Tracing?
Logs tell you what happened. Metrics tell you what’s happening. Traces tell you why it’s slow.
Without tracing:
User: "The API is slow"
You: *checks logs of 12 services* 🤷
With tracing:
User: "The API is slow"
You: *opens trace* "The database query in user-service takes 2.3s"
Tracing Concepts
| Concept | Description |
|---|---|
| Trace | End-to-end journey of a request |
| Span | Single operation within a trace |
| Context | Baggage that propagates across services |
| Parent/Child | Spans form a tree structure |
Trace: user-request-abc123
│
├── Span: api-gateway (12ms)
│ └── Span: auth-check (3ms)
│
├── Span: user-service (45ms)
│ ├── Span: db-query (38ms) ← Problem!
│ └── Span: cache-lookup (2ms)
│
└── Span: notification-service (8ms)
└── Span: send-email (6ms)
Setting Up Jaeger
Docker Compose (Development)
# docker-compose.yml
version: "3.9"
services:
jaeger:
image: jaegertracing/all-in-one:1.53
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "14268:14268" # Jaeger HTTP
Kubernetes (Production)
# jaeger-operator installation
# kubectl create namespace observability
# kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.53.0/jaeger-operator.yaml -n observability
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: https://elasticsearch:9200
index-prefix: jaeger
tls:
ca: /es/certificates/ca.crt
secretName: jaeger-es-secret
collector:
replicas: 2
resources:
limits:
cpu: 500m
memory: 512Mi
query:
replicas: 2
ingress:
enabled: true
hosts:
- jaeger.example.com
OpenTelemetry Setup
Node.js
npm install @opentelemetry/api \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc
// tracing.js - Load BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/health', '/ready'],
},
'@opentelemetry/instrumentation-express': {},
'@opentelemetry/instrumentation-pg': {},
'@opentelemetry/instrumentation-redis': {},
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
// app.js
require('./tracing'); // Must be first!
const express = require('express');
const { trace } = require('@opentelemetry/api');
const app = express();
const tracer = trace.getTracer('user-service');
app.get('/users/:id', async (req, res) => {
// Auto-instrumented: Express, HTTP client, database
// Custom span for business logic
const span = tracer.startSpan('process-user-data');
try {
span.setAttribute('user.id', req.params.id);
const user = await fetchUser(req.params.id);
const enrichedUser = await enrichUserData(user);
span.setStatus({ code: SpanStatusCode.OK });
res.json(enrichedUser);
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
res.status(500).json({ error: error.message });
} finally {
span.end();
}
});
Python
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-exporter-otlp \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requests \
opentelemetry-instrumentation-sqlalchemy
# tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import os
def init_tracing(app, db_engine):
resource = Resource.create({
"service.name": "order-service",
"service.version": "1.0.0",
"deployment.environment": os.getenv("ENVIRONMENT", "development"),
})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(
endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
insecure=True,
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=db_engine)
return trace.get_tracer("order-service")
# app.py
from flask import Flask, request, jsonify
from tracing import init_tracing
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
app = Flask(__name__)
tracer = init_tracing(app, db.engine)
@app.route('/orders', methods=['POST'])
def create_order():
with tracer.start_as_current_span("create-order") as span:
span.set_attribute("order.items_count", len(request.json.get('items', [])))
try:
# Validate inventory
with tracer.start_as_current_span("validate-inventory"):
validate_inventory(request.json['items'])
# Process payment
with tracer.start_as_current_span("process-payment") as payment_span:
payment_span.set_attribute("payment.method", request.json['payment_method'])
process_payment(request.json)
# Create order record
with tracer.start_as_current_span("save-order"):
order = save_order(request.json)
span.set_status(Status(StatusCode.OK))
return jsonify(order), 201
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
return jsonify({"error": str(e)}), 500
Go
// tracing.go
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
conn, err := grpc.DialContext(ctx, "localhost:4317",
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithBlock(),
)
if err != nil {
return nil, err
}
exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithGRPCConn(conn))
if err != nil {
return nil, err
}
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName("inventory-service"),
semconv.ServiceVersion("1.0.0"),
),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.AlwaysSample()),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return tp, nil
}
// main.go
package main
import (
"context"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
var tracer = otel.Tracer("inventory-service")
func main() {
ctx := context.Background()
tp, _ := initTracer(ctx)
defer tp.Shutdown(ctx)
mux := http.NewServeMux()
mux.HandleFunc("/inventory/check", checkInventory)
// Wrap with OpenTelemetry middleware
handler := otelhttp.NewHandler(mux, "inventory-service")
http.ListenAndServe(":8080", handler)
}
func checkInventory(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "check-inventory")
defer span.End()
productID := r.URL.Query().Get("product_id")
span.SetAttributes(attribute.String("product.id", productID))
// Database lookup
ctx, dbSpan := tracer.Start(ctx, "db-query")
inventory, err := db.GetInventory(ctx, productID)
if err != nil {
dbSpan.SetStatus(codes.Error, err.Error())
dbSpan.RecordError(err)
}
dbSpan.End()
// Response
span.SetAttributes(attribute.Int("inventory.quantity", inventory.Quantity))
json.NewEncoder(w).Encode(inventory)
}
Context Propagation
Traces work across services because context propagates via HTTP headers:
# W3C Trace Context (standard)
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
# Jaeger format (legacy)
uber-trace-id: 0af7651916cd43dd8448eb211c80319c:b7ad6b7169203331:0:1
Cross-Service Calls
// Service A: Makes HTTP call to Service B
const { context, propagation } = require('@opentelemetry/api');
async function callServiceB(userId) {
const headers = {};
// Inject trace context into headers
propagation.inject(context.active(), headers);
const response = await fetch('http://service-b/users/' + userId, {
headers: headers, // Contains traceparent header
});
return response.json();
}
// Service B: Extracts context automatically (with auto-instrumentation)
// The incoming traceparent header creates a child span
app.get('/users/:id', async (req, res) => {
// This span is automatically a child of Service A's span
const user = await db.query('SELECT * FROM users WHERE id = $1', [req.params.id]);
res.json(user);
});
Adding Custom Attributes
const { trace } = require('@opentelemetry/api');
function processOrder(order) {
const span = trace.getActiveSpan();
// Add business context to the span
span.setAttributes({
'order.id': order.id,
'order.total': order.total,
'order.items_count': order.items.length,
'customer.id': order.customerId,
'customer.tier': order.customer.tier,
});
// Add events for key moments
span.addEvent('payment_processed', {
'payment.method': order.paymentMethod,
'payment.amount': order.total,
});
span.addEvent('order_confirmed', {
'confirmation.id': generateConfirmationId(),
});
}
Sampling Strategies
Don’t trace everything in production:
// tracing.js
const { ParentBasedSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');
const sdk = new NodeSDK({
// Sample 10% of traces
sampler: new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1),
}),
// ... rest of config
});
// Or environment-based
const sampleRate = process.env.NODE_ENV === 'production' ? 0.01 : 1.0;
Always Sample Errors
const { Sampler, SamplingDecision } = require('@opentelemetry/sdk-trace-base');
class ErrorAwareSampler {
shouldSample(context, traceId, spanName, spanKind, attributes) {
// Always sample errors
if (attributes['error'] || attributes['http.status_code'] >= 500) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Sample 1% of normal requests
return {
decision: Math.random() < 0.01
? SamplingDecision.RECORD_AND_SAMPLED
: SamplingDecision.NOT_RECORD,
};
}
}
Kubernetes Configuration
OpenTelemetry Collector
# otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
# Add Kubernetes metadata
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
# Also send to metrics backend
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [jaeger]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.91.0
args:
- --config=/etc/otel/config.yaml
volumeMounts:
- name: config
mountPath: /etc/otel
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
volumes:
- name: config
configMap:
name: otel-collector-config
Application Configuration
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
spec:
template:
spec:
containers:
- name: app
image: user-service:v1
env:
- name: OTEL_SERVICE_NAME
value: user-service
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector:4317
- name: OTEL_RESOURCE_ATTRIBUTES
value: "deployment.environment=production"
- name: NODE_OPTIONS
value: "--require ./tracing.js" # Auto-load tracing
Debugging with Traces
Finding Slow Requests
Jaeger Query:
service: order-service
operation: POST /orders
minDuration: 2s
tags: error=true
Analyzing a Trace
Trace: order-creation-abc123 (Total: 3.2s)
│
├── api-gateway: 3.2s
│ ├── auth-middleware: 45ms ✓
│ └── route-handler: 3.1s
│ │
│ ├── order-service: 2.8s
│ │ ├── validate-request: 12ms ✓
│ │ ├── check-inventory: 1.8s ⚠️ ← Slow!
│ │ │ └── inventory-service: 1.75s
│ │ │ └── db-query: 1.7s ← Root cause
│ │ └── process-payment: 950ms
│ │ └── payment-gateway: 920ms
│ │
│ └── notification-service: 280ms ✓
Alerting on Traces
# Prometheus alert based on trace metrics
groups:
- name: tracing-alerts
rules:
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(traces_span_duration_bucket{service="order-service"}[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency in order-service"
- alert: HighErrorRate
expr: |
sum(rate(traces_span_total{service="order-service",status="ERROR"}[5m])) /
sum(rate(traces_span_total{service="order-service"}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% in order-service"
Key Takeaways
- Start with auto-instrumentation — covers HTTP, databases, message queues
- Add custom spans for business logic — “validate-inventory”, “process-payment”
- Include business context — order IDs, customer tiers, amounts
- Sample appropriately — 1-10% in production, always sample errors
- Use OpenTelemetry Collector — decouple apps from backends
- Propagate context — W3C Trace Context is the standard
- Connect to metrics and logs — trace IDs in logs enable correlation
Distributed tracing transforms debugging from “something’s slow somewhere” to “this specific database query in this specific service is slow.” Start small, instrument the critical path, and expand from there.