monitoring

All-The-Vibes's avatarfrom All-The-Vibes

Comprehensive observability strategy including metrics, logs, traces, and alerting. Use when setting up new applications, debugging production issues, performance optimization, SLA/SLO definition, incident response, or establishing monitoring infrastructure.

0stars🔀0forks📁View on GitHub🕐Updated Jan 10, 2026

When & Why to Use This Skill

This Claude skill provides a comprehensive observability strategy encompassing metrics, logs, traces, and alerting. It guides users through implementing the Four Golden Signals, defining SLOs/SLAs, and establishing actionable monitoring infrastructure to ensure high system reliability and performance.

Use Cases

  • Setting up comprehensive monitoring and observability stacks for new microservices or cloud-native applications.
  • Diagnosing and debugging complex production incidents using distributed tracing and structured log analysis.
  • Defining Service Level Objectives (SLOs) and managing error budgets to balance development velocity with system stability.
  • Designing effective alerting strategies and runbooks to minimize alert fatigue and ensure rapid incident response.
  • Performing performance optimization and capacity planning using resource-focused monitoring methods like the USE and RED frameworks.
namemonitoring
descriptionComprehensive observability strategy including metrics, logs, traces, and alerting. Use when setting up new applications, debugging production issues, performance optimization, SLA/SLO definition, incident response, or establishing monitoring infrastructure.

Monitoring Skill

Core Principle

Measure what matters, alert on what's actionable.

You can't improve what you don't measure. Observability is not about collecting all possible data—it's about collecting the right data to understand system health, debug issues quickly, and make informed decisions. Alert on symptoms that require action, not on every fluctuation.


When to Use

Use this skill when:

  • Setting up monitoring for new applications or services
  • Debugging production issues or incidents
  • Performing performance optimization
  • Defining SLAs, SLOs, and error budgets
  • Responding to incidents
  • Establishing alerting strategies
  • Implementing distributed tracing
  • Creating dashboards for system observability
  • Analyzing system performance and reliability
  • Planning capacity and scaling decisions

The Three Pillars of Observability

Modern observability is built on three complementary pillars:

┌─────────────────────────────────────────────┐
│           OBSERVABILITY                      │
├─────────────┬──────────────┬────────────────┤
│   METRICS   │     LOGS     │    TRACES      │
├─────────────┼──────────────┼────────────────┤
│ What is     │ What         │ Where is       │
│ happening?  │ happened?    │ the problem?   │
│             │              │                │
│ Time-series │ Event        │ Request flow   │
│ data        │ records      │ through system │
│             │              │                │
│ Aggregated  │ Detailed     │ Distributed    │
│ numbers     │ context      │ context        │
└─────────────┴──────────────┴────────────────┘

1. Metrics: What is Happening?

Time-series numerical data aggregated over time.

Examples:

  • Request rate (requests per second)
  • Error rate (percentage)
  • Response time (milliseconds, percentiles)
  • CPU usage (percentage)
  • Memory usage (bytes)
  • Queue depth (items)

Characteristics:

  • Cheap to collect and store
  • Efficient for alerting
  • Good for dashboards and trends
  • Limited context (numbers only)

When to Use:

  • Real-time monitoring
  • Alerting on thresholds
  • Capacity planning
  • Performance trending

2. Logs: What Happened?

Timestamped event records with contextual information.

Examples:

{
  "timestamp": "2025-01-10T14:32:15Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123",
  "user_id": "user_456",
  "message": "Payment processing failed",
  "error": "Gateway timeout",
  "amount": 99.99,
  "currency": "USD"
}

Characteristics:

  • Rich contextual information
  • Event-by-event detail
  • Expensive to store at scale
  • Powerful for debugging

When to Use:

  • Debugging specific issues
  • Audit trails
  • Understanding event sequences
  • Post-mortem analysis

3. Traces: Where is the Problem?

Request flow tracking across distributed systems.

User Request → API Gateway → Auth Service → User Service → Database
                    ↓
              Payment Service → Payment Gateway
                    ↓
              Email Service → Email Provider

Example Trace:

Trace ID: abc123
Total Duration: 450ms

Span 1: API Gateway        [0-450ms]   ████████████████████
Span 2: Auth Service       [10-30ms]   ██
Span 3: User Service       [35-100ms]  ██████
Span 4: Database Query     [40-95ms]   █████
Span 5: Payment Service    [105-400ms] ████████████████  ← SLOW!
Span 6: Payment Gateway    [120-390ms] ███████████████
Span 7: Email Service      [405-440ms] ███

Characteristics:

  • Shows request path through services
  • Identifies bottlenecks visually
  • Requires instrumentation
  • Can be expensive at scale

When to Use:

  • Debugging latency issues
  • Understanding microservice interactions
  • Optimizing distributed systems
  • Identifying performance bottlenecks

The Four Golden Signals

Google's SRE framework for monitoring any system.

1. Latency

How long does it take to service a request?

Key Metrics:

  • Median response time (p50)
  • 95th percentile (p95)
  • 99th percentile (p99)
  • 99.9th percentile (p999)

Why Percentiles Matter:

Average:  100ms  ← Hides outliers!
p50:       80ms  ← Half of requests
p95:      200ms  ← 95% of requests
p99:      500ms  ← 99% of requests (worst experience)
p999:    2000ms  ← Worst 0.1% (but affects real users!)

Alert Example:

IF p99_response_time > 1000ms for 5 minutes
THEN alert: "High latency affecting 1% of users"

2. Traffic

How much demand is being placed on the system?

Key Metrics:

  • Requests per second (RPS)
  • Transactions per second (TPS)
  • Concurrent users
  • Data throughput (bytes/sec)

Why It Matters:

  • Understand capacity needs
  • Detect traffic spikes (DDoS, viral events)
  • Plan scaling decisions
  • Correlate with other signals

Alert Example:

IF requests_per_second > 10000 for 10 minutes
THEN alert: "Unusual traffic spike detected"

3. Errors

What is the rate of failing requests?

Key Metrics:

  • Error rate (percentage)
  • HTTP 5xx errors
  • HTTP 4xx errors (client errors)
  • Exception rate
  • Failed transactions

Error Budget:

SLO: 99.9% availability
Error budget per month: 0.1% = 43 minutes downtime

If error rate = 1% for 1 hour
→ Consumed 36 minutes of error budget
→ Only 7 minutes remaining this month!

Alert Example:

IF error_rate > 1% for 5 minutes
THEN page: "Critical: High error rate"

4. Saturation

How full is the service?

Key Metrics:

  • CPU utilization (%)
  • Memory utilization (%)
  • Disk I/O usage
  • Network bandwidth
  • Connection pool usage
  • Queue depth

Why It Matters:

  • Predict capacity issues before they happen
  • Know when to scale
  • Avoid resource exhaustion

Alert Example:

IF memory_usage > 90% for 10 minutes
THEN alert: "Memory saturation warning"

Alternative Monitoring Frameworks

RED Method (Request-focused)

Best for request-driven services (APIs, web services).

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency (distribution)

Example Dashboard:

Service: payment-api

Rate:     1,250 req/sec  ▲ 15%
Errors:   12.5 req/sec   ▲ 200% ⚠️
Duration:
  p50:    45ms
  p95:    120ms
  p99:    280ms

USE Method (Resource-focused)

Best for infrastructure and hardware monitoring.

  • Utilization: % time resource was busy
  • Saturation: Queue length or wait time
  • Errors: Error count

Example:

CPU:
  Utilization: 75%
  Saturation:  Load average 4.2 (4 cores)
  Errors:      0

Disk:
  Utilization: 60%
  Saturation:  I/O queue depth: 12
  Errors:      3 read errors

Monitoring Stack Tools

Metrics Collection and Storage

Prometheus (Open Source)

Pull-based metrics collection and time-series database.

# prometheus.yml
scrape_configs:
  - job_name: 'myapp'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:8080']

Application Instrumentation (Python):

from prometheus_client import Counter, Histogram, start_http_server
import time

# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration')

# Instrument code
@request_duration.time()
def handle_request(method, endpoint):
    request_count.labels(method=method, endpoint=endpoint).inc()
    # Handle request...

# Expose metrics endpoint
start_http_server(8000)  # Metrics at http://localhost:8000/metrics

PromQL Queries:

# Request rate (requests per second)
rate(http_requests_total[5m])

# Error rate (percentage)
rate(http_requests_total{status=~"5.."}[5m])
  / rate(http_requests_total[5m]) * 100

# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU usage by pod
container_cpu_usage_seconds_total{pod=~"myapp-.*"}

Benefits:

  • Powerful query language (PromQL)
  • Excellent for Kubernetes
  • Active ecosystem
  • Free and open source

Drawbacks:

  • Limited long-term storage (use Thanos/Cortex)
  • Pull-based (need service discovery)
  • Requires learning PromQL

Grafana (Open Source)

Visualization and dashboarding for metrics.

Features:

  • Connects to Prometheus, InfluxDB, CloudWatch, etc.
  • Beautiful, customizable dashboards
  • Alerting capabilities
  • Dashboard sharing and templates

Example Dashboard JSON:

{
  "title": "Service Overview",
  "panels": [
    {
      "title": "Request Rate",
      "targets": [{
        "expr": "rate(http_requests_total[5m])"
      }],
      "type": "graph"
    },
    {
      "title": "Error Rate",
      "targets": [{
        "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]) * 100"
      }],
      "alert": {
        "conditions": [{
          "evaluator": {"type": "gt", "params": [1]},
          "query": {"params": ["A", "5m", "now"]}
        }]
      }
    }
  ]
}

CloudWatch (AWS)

AWS-native metrics, logs, and alarms.

# Python: Publishing custom metrics
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[
        {
            'MetricName': 'RequestCount',
            'Value': 1,
            'Unit': 'Count',
            'Timestamp': datetime.utcnow(),
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'production'},
                {'Name': 'Service', 'Value': 'api'}
            ]
        }
    ]
)

Benefits:

  • Native AWS integration
  • No infrastructure management
  • Automatic metrics for AWS services

Drawbacks:

  • AWS-specific (vendor lock-in)
  • Can be expensive at scale
  • Limited query capabilities

Datadog (Commercial SaaS)

All-in-one observability platform.

Features:

  • Metrics, logs, traces, and APM in one platform
  • 500+ integrations
  • Machine learning anomaly detection
  • Powerful dashboards and alerting
# Python: Datadog StatsD
from datadog import initialize, statsd

initialize(api_key='YOUR_API_KEY')

# Increment counter
statsd.increment('myapp.requests', tags=['endpoint:/api/users'])

# Record timing
statsd.timing('myapp.request_duration', 250)

# Set gauge
statsd.gauge('myapp.queue_depth', 42)

Benefits:

  • Comprehensive platform
  • Minimal setup
  • Great UX and correlation features

Drawbacks:

  • Expensive (per-host pricing)
  • Vendor lock-in
  • Less control than self-hosted

Logging

ELK Stack (Elasticsearch, Logstash, Kibana)

Open-source log aggregation and analysis.

Application → Logstash → Elasticsearch → Kibana
              (collect)  (store/index)   (visualize)

Structured Logging Example:

import logging
import json

class JsonFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            'timestamp': self.formatTime(record),
            'level': record.levelname,
            'message': record.getMessage(),
            'service': 'myapp',
            'trace_id': getattr(record, 'trace_id', None)
        }
        return json.dumps(log_data)

handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logging.root.addHandler(handler)

Logstash Configuration:

input {
  file {
    path => "/var/log/myapp/*.log"
    codec => "json"
  }
}

filter {
  if [level] == "ERROR" {
    mutate {
      add_tag => ["error"]
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "myapp-logs-%{+YYYY.MM.dd}"
  }
}

Loki + Promtail (Grafana Labs)

Log aggregation designed like Prometheus (labels, not full-text indexing).

Benefits:

  • Lower storage costs than Elasticsearch
  • Native Grafana integration
  • Label-based indexing (like Prometheus)

Promtail Configuration:

server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: myapp
    static_configs:
      - targets:
          - localhost
        labels:
          job: myapp
          __path__: /var/log/myapp/*.log

Querying Logs in Grafana:

{job="myapp"} |= "error" | json | level="ERROR"

Splunk (Commercial)

Enterprise log management platform.

Features:

  • Powerful search and analytics
  • Machine learning for anomaly detection
  • Compliance and security use cases
  • Extensive integrations

Drawbacks:

  • Very expensive
  • Steep learning curve
  • Resource intensive

Distributed Tracing

Jaeger (Open Source)

Distributed tracing system developed by Uber.

OpenTelemetry Instrumentation (Python):

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

# Setup
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

# Usage
def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order_id", order_id)

        with tracer.start_as_current_span("validate_order"):
            validate(order_id)

        with tracer.start_as_current_span("charge_payment"):
            charge(order_id)

        with tracer.start_as_current_span("send_confirmation"):
            send_email(order_id)

OpenTelemetry (Standard)

Vendor-neutral instrumentation standard.

Benefits:

  • Single instrumentation for metrics, logs, traces
  • Export to any backend (Jaeger, Zipkin, Datadog, etc.)
  • Language support: Python, Go, Java, JavaScript, .NET, Rust

Auto-Instrumentation (Python):

# Install
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run with auto-instrumentation
opentelemetry-instrument \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --service_name myapp \
  python app.py

AWS X-Ray

AWS-native distributed tracing.

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Patch libraries
patch_all()

# Manual instrumentation
@xray_recorder.capture('process_order')
def process_order(order_id):
    xray_recorder.current_segment().put_annotation('order_id', order_id)
    # Process order...

Alerting Strategy

Alert Fatigue: The Silent Killer

Bad Alerting:

3:00 AM: Disk usage 71% ⚠️
3:15 AM: Memory usage 82% ⚠️
3:30 AM: CPU spike to 90% for 10 seconds ⚠️
3:45 AM: Database connection pool 70% full ⚠️
4:00 AM: Disk usage 72% ⚠️

Result: On-call engineer ignores/silences all alerts. Real issues get missed.

Good Alerting:

3:00 AM: [CRITICAL] Error rate 15% for 10 minutes - users affected!

Result: On-call engineer immediately investigates actionable issue.


Alert Severity Levels

Use 3-4 severity levels, no more.

Critical (Page On-Call)

  • User Impact: Service down or severely degraded
  • Examples:
    • Error rate > 5%
    • Service completely unavailable
    • Data loss occurring
  • Response: Immediate action required (wake up engineer)

Warning (Notify Team Channel)

  • User Impact: Potential future issue, no immediate user impact
  • Examples:
    • Error rate > 1%
    • Disk usage > 85%
    • Memory usage trending toward saturation
  • Response: Investigate during business hours

Info (Log Only)

  • User Impact: None
  • Examples:
    • Deployment completed
    • Auto-scaling triggered
    • Configuration change
  • Response: Awareness only, no action

Alert Best Practices

1. Alert on Symptoms, Not Causes

Bad: Cause-based alert

IF cpu_usage > 80% THEN alert

Problem: High CPU might not affect users. Not actionable.

Good: Symptom-based alert

IF p99_latency > 1000ms AND error_rate > 1% THEN alert

Why: Users are actually affected. Clear action needed.


2. Make Alerts Actionable

Every alert must answer: "What should I do about this?"

Bad Alert:

ALERT: Database queries slow

Good Alert:

ALERT: Database p99 query time > 5s for 10 minutes

User Impact: Checkout page timing out for 1% of users
Runbook: https://wiki.company.com/runbooks/slow-database
Dashboard: https://grafana.company.com/d/database
Recent Changes: Last deploy 2 hours ago (v1.2.3)

Immediate Actions:
1. Check slow query log
2. Check for locks: SELECT * FROM pg_locks WHERE granted=false
3. Consider rolling back deploy if started after v1.2.3 deploy

3. Use Error Budgets, Not Arbitrary Thresholds

Error Budget Concept:

SLO: 99.9% availability = 99.9% successful requests
Error Budget: 0.1% of requests can fail

Monthly Error Budget:
- 0.1% of 100M requests = 100,000 failed requests allowed
- 43 minutes of downtime allowed

Alert when error budget consumption rate too high:
IF current_error_rate will exhaust error_budget in < 7 days
THEN alert: "Error budget burn rate too high"

4. Avoid Alert Storms

Use alert grouping and deduplication:

# Alertmanager (Prometheus)
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

inhibit_rules:
  # Don't alert on high latency if service is down
  - source_match:
      severity: 'critical'
      alertname: 'ServiceDown'
    target_match:
      severity: 'warning'
      alertname: 'HighLatency'
    equal: ['service']

5. Alert Tuning Process

Deploy Alert → Monitor False Positive Rate
                      ↓
            Too many false positives?
                   ↙        ↘
                 YES         NO
                  ↓           ↓
         Adjust thresholds   Keep alert
         Adjust duration
         Add conditions
                  ↓
         Re-deploy alert
                  ↓
              Monitor

Target:

  • False positive rate < 5%
  • Each alert should lead to action 90%+ of the time

SLIs, SLOs, and SLAs

Service Level Indicator (SLI)

A metric that measures service quality.

Examples:

  • Request success rate
  • Request latency (p95)
  • System throughput
  • Data freshness

Good SLI:

SLI = (successful requests) / (total requests)
    = 99,500 / 100,000
    = 99.5%

Service Level Objective (SLO)

Target value or range for an SLI.

Examples:

  • 99.9% of requests succeed
  • 95% of requests complete in < 300ms
  • 99% of data processed within 1 hour

SLO Structure:

SLO: [percentage] of [what] in [time period]

Example:
SLO: 99.9% of API requests succeed in a rolling 30-day window

Service Level Agreement (SLA)

Contract with users about service levels, often with consequences.

Example:

SLA: 99.9% uptime monthly

Uptime: 99.9% - 100%  → No credit
Uptime: 99.0% - 99.9% → 10% credit
Uptime: 95.0% - 99.0% → 25% credit
Uptime: < 95.0%       → 50% credit

Relationship

SLA ≥ SLO > SLI (actual)

Example:
SLA:  99.5% (contractual minimum)
SLO:  99.9% (internal target - buffer above SLA)
SLI:  99.95% (actual measurement)

Buffer = SLA - SLO = 0.4% = error budget

Setting Good SLOs

Don't Aim for 100%

100% is:

  • Impossible to achieve
  • Prevents taking risks (deploys, experiments)
  • Not necessary (users don't notice 99.9% vs 100%)

Instead:

  • Set realistic SLOs based on user needs
  • Use error budgets to balance reliability and velocity

Start with Current Performance

Step 1: Measure current SLI for 30 days
  Current: 99.7% success rate

Step 2: Set SLO slightly below current (buffer)
  SLO: 99.5%

Step 3: Monitor and adjust quarterly
  If easily meeting SLO → Consider raising
  If struggling → Investigate systemic issues

Dashboard Design Patterns

1. Overview Dashboard (At-a-Glance Health)

Purpose: Quickly answer "Is everything OK?"

Panels:

  • Overall health status (green/yellow/red)
  • Four Golden Signals (Latency, Traffic, Errors, Saturation)
  • Recent deployments timeline
  • Active alerts count

Layout:

┌─────────────────────────────────────────────┐
│  System Health: GREEN ✓                     │
├──────────────┬──────────────┬───────────────┤
│  Latency     │  Traffic     │  Errors       │
│  p99: 250ms  │  1.2K req/s  │  0.05%        │
│  ████░░      │  ██████░░    │  ░░░░░░       │
├──────────────┴──────────────┴───────────────┤
│  Recent Deployments                          │
│  ──●────●──────────●─── (last 24h)          │
│                                              │
│  Active Alerts: 0                            │
└──────────────────────────────────────────────┘

2. Service Dashboard (Deep-Dive)

Purpose: Understand specific service performance.

Panels:

  • Request rate (by endpoint)
  • Error rate (by endpoint and status code)
  • Latency (p50, p95, p99)
  • Resource usage (CPU, memory)
  • Dependency health (database, cache, external APIs)

3. Business Metrics Dashboard

Purpose: Connect technical metrics to business outcomes.

Examples:

  • User signups per hour
  • Successful transactions
  • Revenue per minute
  • Shopping cart abandonment rate
  • Search conversion rate

Why It Matters: Technical outages should correlate with business impact:

Error rate spike at 2:00 PM
   ↓
Transactions dropped 80%
   ↓
Revenue loss: $10,000/hour
   ↓
Priority: CRITICAL

4. Incident Response Dashboard

Purpose: Information needed during active incidents.

Panels:

  • Real-time error rate
  • Recent deployments
  • Recent alerts
  • Service dependency map
  • Key logs (errors/warnings)
  • On-call engineer contact

Monitoring Best Practices

1. Instrument Early

Add monitoring from day one, not after issues arise.

# Bad: No instrumentation
def process_payment(amount):
    return gateway.charge(amount)

# Good: Instrumented
@metrics.timer('payment_duration')
def process_payment(amount):
    with tracer.start_span('process_payment'):
        try:
            result = gateway.charge(amount)
            metrics.increment('payment_success')
            return result
        except Exception as e:
            metrics.increment('payment_failure')
            logger.error('Payment failed', extra={
                'amount': amount,
                'error': str(e),
                'trace_id': get_trace_id()
            })
            raise

2. Use Structured Logging

Bad: Unstructured

logger.info(f"User {user_id} purchased {item} for ${amount}")

Problem: Hard to parse, query, and alert on.

Good: Structured (JSON)

logger.info("purchase_completed", extra={
    'event': 'purchase',
    'user_id': user_id,
    'item_id': item,
    'amount': amount,
    'currency': 'USD',
    'trace_id': trace_id
})

Benefit: Easy to filter, aggregate, and build alerts.


3. Include Trace IDs Everywhere

Connect metrics → logs → traces using trace ID.

import uuid

def handle_request():
    trace_id = str(uuid.uuid4())

    # Add to metrics
    metrics.increment('requests', tags={'trace_id': trace_id})

    # Add to logs
    logger.info('Processing request', extra={'trace_id': trace_id})

    # Add to response headers (for debugging)
    return response, {'X-Trace-ID': trace_id}

Debugging Flow:

1. User reports slow request
2. Find trace_id in response headers
3. Search logs: trace_id="abc123"
4. View trace in Jaeger: abc123
5. Identify slow span: database query took 5s
6. Fix query performance

4. Monitor Dependencies

Your service is only as reliable as its dependencies.

# Monitor external API
@metrics.timer('external_api_duration')
def call_external_api():
    try:
        response = requests.get('https://api.external.com', timeout=5)
        metrics.increment('external_api_success')
        return response
    except requests.Timeout:
        metrics.increment('external_api_timeout')
        raise
    except requests.RequestException:
        metrics.increment('external_api_error')
        raise

Dashboard:

Dependencies Health:
  Database:     99.9% success ✓
  Redis Cache:  99.95% success ✓
  Payment API:  97.2% success ⚠️  ← Issue!
  Email API:    99.8% success ✓

5. Tag/Label Metrics

Enable filtering and aggregation.

# Good: Tagged metrics
metrics.increment('http_requests', tags={
    'method': 'POST',
    'endpoint': '/api/users',
    'status': '200',
    'environment': 'production'
})

# Now you can query:
# - All POST requests
# - All /api/users requests
# - All 5xx errors
# - Production vs staging comparison

6. Avoid High-Cardinality Labels

Bad: Unique value per request

metrics.increment('requests', tags={
    'user_id': user_id  # Millions of unique values!
})

Problem: Creates millions of time series, explodes storage costs.

Good: Low cardinality

metrics.increment('requests', tags={
    'user_tier': 'premium'  # Only 3-4 unique values
})

Monitoring Workflow

1. Development Phase

Developer writes code
        ↓
Add instrumentation (metrics, logs, traces)
        ↓
Define local monitoring (docker-compose)
        ↓
Test monitoring locally
        ↓
Commit code with instrumentation

2. Deployment Phase

Code deployed to staging
        ↓
Verify metrics appear in Prometheus/Datadog
        ↓
Create/update Grafana dashboards
        ↓
Define alerts in Alertmanager
        ↓
Test alerts (trigger conditions manually)
        ↓
Deploy to production
        ↓
Monitor dashboards during deployment

3. Operations Phase

Monitor dashboards daily
        ↓
Alert fires → Investigate
        ↓
Use logs and traces to debug
        ↓
Fix issue
        ↓
Update runbooks
        ↓
Tune alert if false positive

Integration with Other Skills

With Deployment

  • Monitor metrics during deployment
  • Rollback if error rate increases
  • Dashboard shows deployment markers
  • Alert on deployment failures

With Performance Optimization

  • Metrics identify bottlenecks
  • Traces show slow code paths
  • Before/after performance comparison
  • Monitor optimization impact

With Infrastructure

  • Monitor infrastructure resources (CPU, memory, disk)
  • Capacity planning based on trends
  • Alert on infrastructure issues
  • Auto-scaling triggered by metrics

With CI/CD

  • CI/CD pipeline emits metrics
  • Alert on pipeline failures
  • Performance tests validate SLOs
  • Automated canary analysis

Quick Reference

Metrics to Monitor Checklist

Application:

  • Request rate (requests/sec)
  • Error rate (%)
  • Response time (p50, p95, p99)
  • Active connections/requests

Infrastructure:

  • CPU usage (%)
  • Memory usage (%)
  • Disk usage (%)
  • Network I/O

Dependencies:

  • Database query time
  • Cache hit rate
  • External API response time
  • Queue depth

Business:

  • User signups
  • Successful transactions
  • Revenue metrics
  • User engagement

Alerting Checklist

  • Alert has clear severity (critical/warning/info)
  • Alert includes user impact description
  • Runbook link included
  • Dashboard link included
  • Alert is actionable (not just informational)
  • Alert tested (triggered manually)
  • On-call knows how to respond
  • False positive rate < 5%

Logging Best Practices

  • Use structured logging (JSON)
  • Include trace ID in all logs
  • Use appropriate log levels (ERROR, WARN, INFO, DEBUG)
  • Don't log sensitive data (passwords, credit cards)
  • Include context (user_id, request_id, etc.)
  • Centralized log aggregation configured
  • Logs retained for compliance period (e.g., 90 days)

SLO Template

Service: [service-name]
SLO: [percentage]% of [what] in [time period]

Example:
Service: payment-api
SLO: 99.9% of API requests succeed in a rolling 30-day window

Measurement:
  SLI: (successful_requests / total_requests) * 100
  Window: 30 days rolling
  Target: 99.9%

Error Budget:
  Allowed failures: 0.1% = ~43 minutes downtime/month
  Current status: 99.95% (well within budget)

Alerts:
  - Error budget burn rate > 5x: CRITICAL (page on-call)
  - Error budget < 20% remaining: WARNING (notify team)

Common Monitoring Commands

# Prometheus
# Query current request rate
curl 'http://localhost:9090/api/v1/query?query=rate(http_requests_total[5m])'

# Grafana
# Create API key
curl -X POST http://admin:admin@localhost:3000/api/auth/keys \
  -H "Content-Type: application/json" \
  -d '{"name":"monitoring","role":"Viewer"}'

# Jaeger
# Query traces by service
curl 'http://localhost:16686/api/traces?service=myapp&limit=10'

# CloudWatch
# Get metric statistics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --start-time 2025-01-10T00:00:00Z \
  --end-time 2025-01-10T23:59:59Z \
  --period 3600 \
  --statistics Average

Remember: Good monitoring is invisible when everything works, but invaluable when things break. Instrument early, alert sparingly, and always connect metrics to user impact. Measure what matters, not just what's easy to measure.