When a request touches five services before returning, finding why it’s slow becomes a nightmare. Distributed tracing connects the dots, showing you exactly where time is spent. Here’s how to implement it properly with OpenTelemetry and Jaeger.

Why Distributed Tracing?

Logs tell you what happened. Metrics tell you what’s happening. Traces tell you why it’s slow.

Without tracing:
User: "The API is slow"
You: *checks logs of 12 services* 🤷

With tracing:
User: "The API is slow"
You: *opens trace* "The database query in user-service takes 2.3s"

Tracing Concepts

ConceptDescription
TraceEnd-to-end journey of a request
SpanSingle operation within a trace
ContextBaggage that propagates across services
Parent/ChildSpans form a tree structure
Trace: user-request-abc123

├── Span: api-gateway (12ms)
│   └── Span: auth-check (3ms)

├── Span: user-service (45ms)
│   ├── Span: db-query (38ms)    ← Problem!
│   └── Span: cache-lookup (2ms)

└── Span: notification-service (8ms)
    └── Span: send-email (6ms)

Setting Up Jaeger

Docker Compose (Development)

# docker-compose.yml
version: "3.9"

services:
  jaeger:
    image: jaegertracing/all-in-one:1.53
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
      - "14268:14268"   # Jaeger HTTP

Kubernetes (Production)

# jaeger-operator installation
# kubectl create namespace observability
# kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.53.0/jaeger-operator.yaml -n observability

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch:9200
        index-prefix: jaeger
        tls:
          ca: /es/certificates/ca.crt
    secretName: jaeger-es-secret
  
  collector:
    replicas: 2
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  
  query:
    replicas: 2
  
  ingress:
    enabled: true
    hosts:
      - jaeger.example.com

OpenTelemetry Setup

Node.js

npm install @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc
// tracing.js - Load BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/health', '/ready'],
      },
      '@opentelemetry/instrumentation-express': {},
      '@opentelemetry/instrumentation-pg': {},
      '@opentelemetry/instrumentation-redis': {},
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});
// app.js
require('./tracing'); // Must be first!

const express = require('express');
const { trace } = require('@opentelemetry/api');

const app = express();
const tracer = trace.getTracer('user-service');

app.get('/users/:id', async (req, res) => {
  // Auto-instrumented: Express, HTTP client, database
  
  // Custom span for business logic
  const span = tracer.startSpan('process-user-data');
  try {
    span.setAttribute('user.id', req.params.id);
    
    const user = await fetchUser(req.params.id);
    const enrichedUser = await enrichUserData(user);
    
    span.setStatus({ code: SpanStatusCode.OK });
    res.json(enrichedUser);
  } catch (error) {
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
    span.recordException(error);
    res.status(500).json({ error: error.message });
  } finally {
    span.end();
  }
});

Python

pip install opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-flask \
  opentelemetry-instrumentation-requests \
  opentelemetry-instrumentation-sqlalchemy
# tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import os

def init_tracing(app, db_engine):
    resource = Resource.create({
        "service.name": "order-service",
        "service.version": "1.0.0",
        "deployment.environment": os.getenv("ENVIRONMENT", "development"),
    })

    provider = TracerProvider(resource=resource)
    
    exporter = OTLPSpanExporter(
        endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
        insecure=True,
    )
    
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Auto-instrument
    FlaskInstrumentor().instrument_app(app)
    RequestsInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument(engine=db_engine)

    return trace.get_tracer("order-service")
# app.py
from flask import Flask, request, jsonify
from tracing import init_tracing
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

app = Flask(__name__)
tracer = init_tracing(app, db.engine)

@app.route('/orders', methods=['POST'])
def create_order():
    with tracer.start_as_current_span("create-order") as span:
        span.set_attribute("order.items_count", len(request.json.get('items', [])))
        
        try:
            # Validate inventory
            with tracer.start_as_current_span("validate-inventory"):
                validate_inventory(request.json['items'])
            
            # Process payment
            with tracer.start_as_current_span("process-payment") as payment_span:
                payment_span.set_attribute("payment.method", request.json['payment_method'])
                process_payment(request.json)
            
            # Create order record
            with tracer.start_as_current_span("save-order"):
                order = save_order(request.json)
            
            span.set_status(Status(StatusCode.OK))
            return jsonify(order), 201
            
        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            return jsonify({"error": str(e)}), 500

Go

// tracing.go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
)

func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    conn, err := grpc.DialContext(ctx, "localhost:4317",
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithBlock(),
    )
    if err != nil {
        return nil, err
    }

    exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithGRPCConn(conn))
    if err != nil {
        return nil, err
    }

    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName("inventory-service"),
            semconv.ServiceVersion("1.0.0"),
        ),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()),
    )

    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))

    return tp, nil
}
// main.go
package main

import (
    "context"
    "net/http"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

var tracer = otel.Tracer("inventory-service")

func main() {
    ctx := context.Background()
    tp, _ := initTracer(ctx)
    defer tp.Shutdown(ctx)

    mux := http.NewServeMux()
    mux.HandleFunc("/inventory/check", checkInventory)

    // Wrap with OpenTelemetry middleware
    handler := otelhttp.NewHandler(mux, "inventory-service")
    http.ListenAndServe(":8080", handler)
}

func checkInventory(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "check-inventory")
    defer span.End()

    productID := r.URL.Query().Get("product_id")
    span.SetAttributes(attribute.String("product.id", productID))

    // Database lookup
    ctx, dbSpan := tracer.Start(ctx, "db-query")
    inventory, err := db.GetInventory(ctx, productID)
    if err != nil {
        dbSpan.SetStatus(codes.Error, err.Error())
        dbSpan.RecordError(err)
    }
    dbSpan.End()

    // Response
    span.SetAttributes(attribute.Int("inventory.quantity", inventory.Quantity))
    json.NewEncoder(w).Encode(inventory)
}

Context Propagation

Traces work across services because context propagates via HTTP headers:

# W3C Trace Context (standard)
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

# Jaeger format (legacy)
uber-trace-id: 0af7651916cd43dd8448eb211c80319c:b7ad6b7169203331:0:1

Cross-Service Calls

// Service A: Makes HTTP call to Service B
const { context, propagation } = require('@opentelemetry/api');

async function callServiceB(userId) {
  const headers = {};
  
  // Inject trace context into headers
  propagation.inject(context.active(), headers);
  
  const response = await fetch('http://service-b/users/' + userId, {
    headers: headers,  // Contains traceparent header
  });
  
  return response.json();
}
// Service B: Extracts context automatically (with auto-instrumentation)
// The incoming traceparent header creates a child span
app.get('/users/:id', async (req, res) => {
  // This span is automatically a child of Service A's span
  const user = await db.query('SELECT * FROM users WHERE id = $1', [req.params.id]);
  res.json(user);
});

Adding Custom Attributes

const { trace } = require('@opentelemetry/api');

function processOrder(order) {
  const span = trace.getActiveSpan();
  
  // Add business context to the span
  span.setAttributes({
    'order.id': order.id,
    'order.total': order.total,
    'order.items_count': order.items.length,
    'customer.id': order.customerId,
    'customer.tier': order.customer.tier,
  });
  
  // Add events for key moments
  span.addEvent('payment_processed', {
    'payment.method': order.paymentMethod,
    'payment.amount': order.total,
  });
  
  span.addEvent('order_confirmed', {
    'confirmation.id': generateConfirmationId(),
  });
}

Sampling Strategies

Don’t trace everything in production:

// tracing.js
const { ParentBasedSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');

const sdk = new NodeSDK({
  // Sample 10% of traces
  sampler: new ParentBasedSampler({
    root: new TraceIdRatioBasedSampler(0.1),
  }),
  // ... rest of config
});

// Or environment-based
const sampleRate = process.env.NODE_ENV === 'production' ? 0.01 : 1.0;

Always Sample Errors

const { Sampler, SamplingDecision } = require('@opentelemetry/sdk-trace-base');

class ErrorAwareSampler {
  shouldSample(context, traceId, spanName, spanKind, attributes) {
    // Always sample errors
    if (attributes['error'] || attributes['http.status_code'] >= 500) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    
    // Sample 1% of normal requests
    return {
      decision: Math.random() < 0.01 
        ? SamplingDecision.RECORD_AND_SAMPLED 
        : SamplingDecision.NOT_RECORD,
    };
  }
}

Kubernetes Configuration

OpenTelemetry Collector

# otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024
      
      memory_limiter:
        check_interval: 1s
        limit_mib: 1000
        spike_limit_mib: 200
      
      # Add Kubernetes metadata
      k8sattributes:
        auth_type: serviceAccount
        extract:
          metadata:
            - k8s.pod.name
            - k8s.namespace.name
            - k8s.deployment.name

    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
      
      # Also send to metrics backend
      prometheusremotewrite:
        endpoint: http://prometheus:9090/api/v1/write

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [jaeger]
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.91.0
          args:
            - --config=/etc/otel/config.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/otel
          ports:
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
      volumes:
        - name: config
          configMap:
            name: otel-collector-config

Application Configuration

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  template:
    spec:
      containers:
        - name: app
          image: user-service:v1
          env:
            - name: OTEL_SERVICE_NAME
              value: user-service
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: http://otel-collector:4317
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "deployment.environment=production"
            - name: NODE_OPTIONS
              value: "--require ./tracing.js"  # Auto-load tracing

Debugging with Traces

Finding Slow Requests

Jaeger Query:
service: order-service
operation: POST /orders
minDuration: 2s
tags: error=true

Analyzing a Trace

Trace: order-creation-abc123 (Total: 3.2s)

├── api-gateway: 3.2s
│   ├── auth-middleware: 45ms ✓
│   └── route-handler: 3.1s
│       │
│       ├── order-service: 2.8s
│       │   ├── validate-request: 12ms ✓
│       │   ├── check-inventory: 1.8s ⚠️ ← Slow!
│       │   │   └── inventory-service: 1.75s
│       │   │       └── db-query: 1.7s ← Root cause
│       │   └── process-payment: 950ms
│       │       └── payment-gateway: 920ms
│       │
│       └── notification-service: 280ms ✓

Alerting on Traces

# Prometheus alert based on trace metrics
groups:
  - name: tracing-alerts
    rules:
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99, 
            sum(rate(traces_span_duration_bucket{service="order-service"}[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency in order-service"
      
      - alert: HighErrorRate
        expr: |
          sum(rate(traces_span_total{service="order-service",status="ERROR"}[5m])) /
          sum(rate(traces_span_total{service="order-service"}[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% in order-service"

Key Takeaways

  1. Start with auto-instrumentation — covers HTTP, databases, message queues
  2. Add custom spans for business logic — “validate-inventory”, “process-payment”
  3. Include business context — order IDs, customer tiers, amounts
  4. Sample appropriately — 1-10% in production, always sample errors
  5. Use OpenTelemetry Collector — decouple apps from backends
  6. Propagate context — W3C Trace Context is the standard
  7. Connect to metrics and logs — trace IDs in logs enable correlation

Distributed tracing transforms debugging from “something’s slow somewhere” to “this specific database query in this specific service is slow.” Start small, instrument the critical path, and expand from there.