What is distributed-tracing?

This Claude skill provides a comprehensive framework for implementing distributed tracing in microservices architectures. It covers end-to-end request tracking using OpenTelemetry, Jaeger, and Zipkin, enabling deep visibility into system performance, latency bottlenecks, and service dependencies through expert span design and context propagation techniques.

When should I use distributed-tracing?

distributed-tracing is useful in the following scenarios: • Troubleshooting high latency in complex microservices by visualizing the full request journey across service boundaries. • Implementing standardized OpenTelemetry instrumentation in Node.js and Express applications for consistent observability. • Designing and deploying production-ready Jaeger instances on Kubernetes with optimized sampling and storage configurations. • Establishing semantic naming conventions and metadata standards for spans to improve debugging efficiency and log correlation.

name	distributed-tracing
description	Comprehensive distributed tracing with Jaeger, Zipkin, OpenTelemetry, correlation IDs, and span design.

Distributed Tracing

Comprehensive distributed tracing with Jaeger, Zipkin, OpenTelemetry, correlation IDs, and span design.

Overview

Distributed tracing tracks requests as they flow through multiple services, enabling debugging and performance analysis in microservices architectures.

Key Concepts

Trace Model

Trace: End-to-end request journey
Span: Single operation within a trace
Span Context: Propagated trace information
Baggage: Custom key-value pairs carried across services

Span Attributes

Operation Name: What the span represents
Start/End Time: Duration measurement
Tags: Indexed metadata for querying
Logs: Time-stamped events within span
Status: Success, error, or unset

OpenTelemetry Implementation

Instrumentation Setup

// Node.js OpenTelemetry setup
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

const provider = new NodeTracerProvider();

provider.addSpanProcessor(
  new SimpleSpanProcessor(
    new JaegerExporter({
      endpoint: 'http://jaeger:14268/api/traces',
    })
  )
);

provider.register();

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

Manual Span Creation

const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('my-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId);

      // Child span for database operation
      await tracer.startActiveSpan('db.query', async (dbSpan) => {
        dbSpan.setAttribute('db.system', 'postgresql');
        dbSpan.setAttribute('db.statement', 'SELECT * FROM orders WHERE id = $1');
        await db.query('SELECT * FROM orders WHERE id = $1', [orderId]);
        dbSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Context Propagation

const { context, propagation } = require('@opentelemetry/api');

// Extract context from incoming request
app.use((req, res, next) => {
  const ctx = propagation.extract(context.active(), req.headers);
  context.with(ctx, next);
});

// Inject context into outgoing request
async function callService(url) {
  const headers = {};
  propagation.inject(context.active(), headers);

  return fetch(url, { headers });
}

Jaeger Configuration

Kubernetes Deployment

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      resources:
        requests:
          cpu: 1
          memory: 4Gi
  collector:
    maxReplicas: 5
  query:
    replicas: 2

Sampling Strategies

# Jaeger sampling configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: jaeger-sampling
data:
  sampling: |
    {
      "service_strategies": [
        {
          "service": "order-service",
          "type": "probabilistic",
          "param": 0.5
        },
        {
          "service": "payment-service",
          "type": "ratelimiting",
          "param": 100
        }
      ],
      "default_strategy": {
        "type": "probabilistic",
        "param": 0.1
      }
    }

Span Design Guidelines

Naming Conventions

HTTP spans:    HTTP {METHOD} {route}
               HTTP GET /api/users/:id

Database:      {db.system}.{operation}
               postgresql.query

Message:       {messaging.system} {operation} {destination}
               kafka send orders-topic

RPC:           {rpc.system}/{service}/{method}
               grpc/UserService/GetUser

Essential Attributes

// HTTP spans
span.setAttribute('http.method', 'GET');
span.setAttribute('http.url', 'https://api.example.com/users/123');
span.setAttribute('http.status_code', 200);
span.setAttribute('http.request_content_length', 0);
span.setAttribute('http.response_content_length', 1234);

// Database spans
span.setAttribute('db.system', 'postgresql');
span.setAttribute('db.name', 'mydb');
span.setAttribute('db.statement', 'SELECT * FROM users WHERE id = $1');
span.setAttribute('db.operation', 'SELECT');

// Messaging spans
span.setAttribute('messaging.system', 'kafka');
span.setAttribute('messaging.destination', 'orders');
span.setAttribute('messaging.operation', 'send');

Best Practices

Consistent Naming: Follow semantic conventions
Don't Over-Trace: Sample appropriately
Meaningful Spans: Business-relevant operations
Error Recording: Always record exceptions
Context Propagation: Ensure trace continuity

Sampling Strategies

Head-Based Sampling

Decision made at trace start
Simpler, consistent
May miss interesting traces

Tail-Based Sampling

Decision made at trace end
Keeps all errors and slow traces
More resource intensive

Adaptive Sampling

Adjusts rate based on traffic
Balances cost and coverage
Best for variable traffic

Anti-Patterns

Creating spans for every function call
Not propagating context across service boundaries
Ignoring span errors
Sampling 100% in production
Not correlating traces with logs

When to Use

Microservices with complex request flows
Debugging latency issues
Understanding service dependencies
Capacity planning

When NOT to Use

Monolithic applications
Very high-throughput systems without sampling
When storage costs are a concern

distributed-tracing

When & Why to Use This Skill

Use Cases

Distributed Tracing

Overview

Key Concepts

Trace Model

Span Attributes

OpenTelemetry Implementation

Instrumentation Setup

Manual Span Creation

Context Propagation

Jaeger Configuration

Kubernetes Deployment

Sampling Strategies

Span Design Guidelines

Naming Conventions

Essential Attributes

Best Practices

Sampling Strategies

Head-Based Sampling

Tail-Based Sampling

Adaptive Sampling

Anti-Patterns

When to Use

When NOT to Use