What is observability?

This Claude skill provides a comprehensive framework for implementing modern observability patterns, including structured logging, metric design, distributed tracing, and actionable alerting. It enables developers and SREs to build highly observable systems, facilitating faster debugging of production issues, performance bottleneck identification, and the establishment of proactive monitoring strategies using industry-standard methods like RED and USE.

When should I use observability?

observability is useful in the following scenarios: • Case 1: Designing and implementing structured JSON logging formats to improve searchability and error investigation in distributed production environments. • Case 2: Setting up system metrics and dashboards using RED (Rate, Error, Duration) or USE (Utilization, Saturation, Errors) methods to monitor service health and resource capacity. • Case 3: Implementing distributed tracing with context propagation to map request flows across microservices and identify specific latency bottlenecks. • Case 4: Developing actionable alerting rules and Service Level Objectives (SLOs) to reduce alert fatigue and ensure timely incident response.

name	observability
description	Observability patterns including logging, metrics, tracing, and alerting. Auto-triggers when implementing monitoring, debugging production issues, or setting up alerts.

Observability Skill

Three Pillars of Observability

1. Logs

What happened: Discrete events with context
Use for: Debugging, audit trails, error investigation
Challenge: Volume and searchability

2. Metrics

How much/how often: Numeric measurements over time
Use for: Dashboards, alerting, capacity planning
Challenge: Cardinality explosion

3. Traces

Where time was spent: Request flow across services
Use for: Latency analysis, dependency mapping
Challenge: Sampling and storage

Structured Logging

Log Format

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "error",
  "message": "Payment failed",
  "service": "payment-service",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user_789",
  "error": {
    "type": "PaymentDeclined",
    "code": "INSUFFICIENT_FUNDS"
  },
  "duration_ms": 234
}

Log Levels

Level	Use Case
ERROR	Failures requiring attention
WARN	Unexpected but recoverable
INFO	Business events, state changes
DEBUG	Development troubleshooting
TRACE	Fine-grained diagnostic

Best Practices

Use structured JSON format
Include correlation IDs (trace_id)
Never log sensitive data (PII, secrets)
Use consistent field names
Set appropriate log levels

Metrics Design

Types of Metrics

Type	Example	Use Case
Counter	requests_total	Monotonically increasing
Gauge	temperature_celsius	Value that goes up/down
Histogram	request_duration_seconds	Distribution of values
Summary	request_latency_quantiles	Quantile calculations

Naming Convention

<namespace>_<name>_<unit>

Examples:
- http_requests_total
- http_request_duration_seconds
- db_connections_active
- queue_messages_waiting

RED Method (Services)

Rate: Requests per second
Error: Error rate
Duration: Latency distribution

USE Method (Resources)

Utilization: % time busy
Saturation: Queue depth
Errors: Error count

Golden Signals

Latency (response time)
Traffic (requests/sec)
Errors (error rate)
Saturation (resource utilization)

Distributed Tracing

Trace Structure

Trace (trace_id: abc123)
├── Span: HTTP Request (span_id: 001, parent: null)
│   ├── Span: Auth Check (span_id: 002, parent: 001)
│   ├── Span: DB Query (span_id: 003, parent: 001)
│   │   └── Span: Connection Pool (span_id: 004, parent: 003)
│   └── Span: External API (span_id: 005, parent: 001)

Context Propagation

# HTTP Headers
traceparent: 00-abc123-def456-01
tracestate: vendor=value

Sampling Strategies

Strategy	Use Case
Always sample	Development, low traffic
Probabilistic	Production (1-10%)
Rate limiting	Control volume
Tail-based	Capture errors/slow requests

Alerting

Alert Design

# Good alert
name: High Error Rate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
severity: critical
annotations:
  summary: "Error rate above 1% for 5 minutes"
  runbook: "https://wiki/runbooks/high-error-rate"

Alert Quality

Actionable: Clear remediation steps
Relevant: Indicates real problems
Timely: Fast enough to matter
Not noisy: Avoid alert fatigue

SLOs and Error Budgets

SLI: 99.9% of requests complete in < 200ms
SLO: 99.9% availability per month
Error Budget: 0.1% = 43.2 minutes downtime/month

Dashboards

Layout Principles

Overview first: Key metrics at top
Then details: Drill-down sections
Time alignment: Consistent time ranges
Annotations: Mark deployments/incidents

Essential Panels

Request rate (traffic)
Error rate (errors)
Latency percentiles (P50, P95, P99)
Resource utilization (CPU, memory)
Queue depths (saturation)

observability

When & Why to Use This Skill

Use Cases