observability

bigdegenenergy's avatarfrom bigdegenenergy

Observability patterns including logging, metrics, tracing, and alerting. Auto-triggers when implementing monitoring, debugging production issues, or setting up alerts.

0stars🔀0forks📁View on GitHub🕐Updated Jan 11, 2026

When & Why to Use This Skill

This Claude skill provides a comprehensive framework for implementing modern observability patterns, including structured logging, metric design, distributed tracing, and actionable alerting. It enables developers and SREs to build highly observable systems, facilitating faster debugging of production issues, performance bottleneck identification, and the establishment of proactive monitoring strategies using industry-standard methods like RED and USE.

Use Cases

  • Case 1: Designing and implementing structured JSON logging formats to improve searchability and error investigation in distributed production environments.
  • Case 2: Setting up system metrics and dashboards using RED (Rate, Error, Duration) or USE (Utilization, Saturation, Errors) methods to monitor service health and resource capacity.
  • Case 3: Implementing distributed tracing with context propagation to map request flows across microservices and identify specific latency bottlenecks.
  • Case 4: Developing actionable alerting rules and Service Level Objectives (SLOs) to reduce alert fatigue and ensure timely incident response.
nameobservability
descriptionObservability patterns including logging, metrics, tracing, and alerting. Auto-triggers when implementing monitoring, debugging production issues, or setting up alerts.

Observability Skill

Three Pillars of Observability

1. Logs

  • What happened: Discrete events with context
  • Use for: Debugging, audit trails, error investigation
  • Challenge: Volume and searchability

2. Metrics

  • How much/how often: Numeric measurements over time
  • Use for: Dashboards, alerting, capacity planning
  • Challenge: Cardinality explosion

3. Traces

  • Where time was spent: Request flow across services
  • Use for: Latency analysis, dependency mapping
  • Challenge: Sampling and storage

Structured Logging

Log Format

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "error",
  "message": "Payment failed",
  "service": "payment-service",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user_789",
  "error": {
    "type": "PaymentDeclined",
    "code": "INSUFFICIENT_FUNDS"
  },
  "duration_ms": 234
}

Log Levels

Level Use Case
ERROR Failures requiring attention
WARN Unexpected but recoverable
INFO Business events, state changes
DEBUG Development troubleshooting
TRACE Fine-grained diagnostic

Best Practices

  • Use structured JSON format
  • Include correlation IDs (trace_id)
  • Never log sensitive data (PII, secrets)
  • Use consistent field names
  • Set appropriate log levels

Metrics Design

Types of Metrics

Type Example Use Case
Counter requests_total Monotonically increasing
Gauge temperature_celsius Value that goes up/down
Histogram request_duration_seconds Distribution of values
Summary request_latency_quantiles Quantile calculations

Naming Convention

<namespace>_<name>_<unit>

Examples:
- http_requests_total
- http_request_duration_seconds
- db_connections_active
- queue_messages_waiting

RED Method (Services)

  • Rate: Requests per second
  • Error: Error rate
  • Duration: Latency distribution

USE Method (Resources)

  • Utilization: % time busy
  • Saturation: Queue depth
  • Errors: Error count

Golden Signals

  1. Latency (response time)
  2. Traffic (requests/sec)
  3. Errors (error rate)
  4. Saturation (resource utilization)

Distributed Tracing

Trace Structure

Trace (trace_id: abc123)
├── Span: HTTP Request (span_id: 001, parent: null)
│   ├── Span: Auth Check (span_id: 002, parent: 001)
│   ├── Span: DB Query (span_id: 003, parent: 001)
│   │   └── Span: Connection Pool (span_id: 004, parent: 003)
│   └── Span: External API (span_id: 005, parent: 001)

Context Propagation

# HTTP Headers
traceparent: 00-abc123-def456-01
tracestate: vendor=value

Sampling Strategies

Strategy Use Case
Always sample Development, low traffic
Probabilistic Production (1-10%)
Rate limiting Control volume
Tail-based Capture errors/slow requests

Alerting

Alert Design

# Good alert
name: High Error Rate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
severity: critical
annotations:
  summary: "Error rate above 1% for 5 minutes"
  runbook: "https://wiki/runbooks/high-error-rate"

Alert Quality

  • Actionable: Clear remediation steps
  • Relevant: Indicates real problems
  • Timely: Fast enough to matter
  • Not noisy: Avoid alert fatigue

SLOs and Error Budgets

SLI: 99.9% of requests complete in < 200ms
SLO: 99.9% availability per month
Error Budget: 0.1% = 43.2 minutes downtime/month

Dashboards

Layout Principles

  1. Overview first: Key metrics at top
  2. Then details: Drill-down sections
  3. Time alignment: Consistent time ranges
  4. Annotations: Mark deployments/incidents

Essential Panels

  • Request rate (traffic)
  • Error rate (errors)
  • Latency percentiles (P50, P95, P99)
  • Resource utilization (CPU, memory)
  • Queue depths (saturation)