When & Why to Use This Skill
This Claude skill provides a comprehensive framework for implementing modern observability patterns, including structured logging, metric design, distributed tracing, and actionable alerting. It enables developers and SREs to build highly observable systems, facilitating faster debugging of production issues, performance bottleneck identification, and the establishment of proactive monitoring strategies using industry-standard methods like RED and USE.
Use Cases
- Case 1: Designing and implementing structured JSON logging formats to improve searchability and error investigation in distributed production environments.
- Case 2: Setting up system metrics and dashboards using RED (Rate, Error, Duration) or USE (Utilization, Saturation, Errors) methods to monitor service health and resource capacity.
- Case 3: Implementing distributed tracing with context propagation to map request flows across microservices and identify specific latency bottlenecks.
- Case 4: Developing actionable alerting rules and Service Level Objectives (SLOs) to reduce alert fatigue and ensure timely incident response.
| name | observability |
|---|
| description | Observability patterns including logging, metrics, tracing, and alerting. Auto-triggers when implementing monitoring, debugging production issues, or setting up alerts. |
|---|
Observability Skill
Three Pillars of Observability
1. Logs
- What happened: Discrete events with context
- Use for: Debugging, audit trails, error investigation
- Challenge: Volume and searchability
2. Metrics
- How much/how often: Numeric measurements over time
- Use for: Dashboards, alerting, capacity planning
- Challenge: Cardinality explosion
3. Traces
- Where time was spent: Request flow across services
- Use for: Latency analysis, dependency mapping
- Challenge: Sampling and storage
Structured Logging
Log Format
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "error",
"message": "Payment failed",
"service": "payment-service",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user_789",
"error": {
"type": "PaymentDeclined",
"code": "INSUFFICIENT_FUNDS"
},
"duration_ms": 234
}
Log Levels
| Level |
Use Case |
| ERROR |
Failures requiring attention |
| WARN |
Unexpected but recoverable |
| INFO |
Business events, state changes |
| DEBUG |
Development troubleshooting |
| TRACE |
Fine-grained diagnostic |
Best Practices
- Use structured JSON format
- Include correlation IDs (trace_id)
- Never log sensitive data (PII, secrets)
- Use consistent field names
- Set appropriate log levels
Metrics Design
Types of Metrics
| Type |
Example |
Use Case |
| Counter |
requests_total |
Monotonically increasing |
| Gauge |
temperature_celsius |
Value that goes up/down |
| Histogram |
request_duration_seconds |
Distribution of values |
| Summary |
request_latency_quantiles |
Quantile calculations |
Naming Convention
<namespace>_<name>_<unit>
Examples:
- http_requests_total
- http_request_duration_seconds
- db_connections_active
- queue_messages_waiting
RED Method (Services)
- Rate: Requests per second
- Error: Error rate
- Duration: Latency distribution
USE Method (Resources)
- Utilization: % time busy
- Saturation: Queue depth
- Errors: Error count
Golden Signals
- Latency (response time)
- Traffic (requests/sec)
- Errors (error rate)
- Saturation (resource utilization)
Distributed Tracing
Trace Structure
Trace (trace_id: abc123)
├── Span: HTTP Request (span_id: 001, parent: null)
│ ├── Span: Auth Check (span_id: 002, parent: 001)
│ ├── Span: DB Query (span_id: 003, parent: 001)
│ │ └── Span: Connection Pool (span_id: 004, parent: 003)
│ └── Span: External API (span_id: 005, parent: 001)
Context Propagation
# HTTP Headers
traceparent: 00-abc123-def456-01
tracestate: vendor=value
Sampling Strategies
| Strategy |
Use Case |
| Always sample |
Development, low traffic |
| Probabilistic |
Production (1-10%) |
| Rate limiting |
Control volume |
| Tail-based |
Capture errors/slow requests |
Alerting
Alert Design
# Good alert
name: High Error Rate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
runbook: "https://wiki/runbooks/high-error-rate"
Alert Quality
- Actionable: Clear remediation steps
- Relevant: Indicates real problems
- Timely: Fast enough to matter
- Not noisy: Avoid alert fatigue
SLOs and Error Budgets
SLI: 99.9% of requests complete in < 200ms
SLO: 99.9% availability per month
Error Budget: 0.1% = 43.2 minutes downtime/month
Dashboards
Layout Principles
- Overview first: Key metrics at top
- Then details: Drill-down sections
- Time alignment: Consistent time ranges
- Annotations: Mark deployments/incidents
Essential Panels
- Request rate (traffic)
- Error rate (errors)
- Latency percentiles (P50, P95, P99)
- Resource utilization (CPU, memory)
- Queue depths (saturation)