slo-alerting

majiayu000's avatarfrom majiayu000

Define SLIs, SLOs, and implement burn-rate alerting

0stars🔀0forks📁View on GitHub🕐Updated Jan 5, 2026

When & Why to Use This Skill

This Claude skill empowers SRE and DevOps teams to establish robust reliability frameworks by defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs). It facilitates the implementation of advanced burn-rate alerting strategies, moving beyond noisy raw error rates to focus on error budget consumption. By providing structured templates for multi-window strategies and dashboard essentials, it helps teams maintain high service availability while minimizing alert fatigue through industry-standard SRE methodologies.

Use Cases

  • Designing Reliability Frameworks: Define quantitative SLIs like availability and latency for critical user journeys to set realistic and measurable SLO targets.
  • Implementing Burn-Rate Alerting: Configure multi-window alert strategies (e.g., 14.4x burn over 1h) to detect critical budget exhaustion before it impacts the end-user experience.
  • Error Budget Management: Calculate and visualize error budgets to help product and engineering teams balance the pace of feature releases with the necessity of system stability.
  • Reducing Alert Fatigue: Transition from raw error rate triggers to sophisticated burn-rate alerts to ensure on-call engineers only respond to incidents that significantly threaten the error budget.
nameslo-alerting
description"Define SLIs, SLOs, and implement burn-rate alerting"
priority2

SLO Alerting

Define SLIs, set SLO targets, alert on burn rate (not raw error rate).

Concepts

Term Definition Example
SLI Quantitative measure % successful requests
SLO Target for SLI 99.9% success
Error Budget Allowed failure 0.1% = 43 min/month
Burn Rate Budget consumption speed 10x = exhausted in 3 days

Common SLIs

Availability: successful_requests / total_requests
Latency:      requests_under_threshold / total_requests
Error Rate:   error_requests / total_requests

Burn Rate Alerting

Alert on how fast you're consuming budget, not raw error rate:

Alert Level Burn Rate Time to Exhaust
Page (critical) 14.4x 2 days
Page (warning) 6x 5 days
Ticket (medium) 3x 10 days

Multi-Window Strategy

Use long + short windows to balance speed and noise:

# Critical: Fast burn (14.4x over 1h AND 5m)
- alert: HighBurnRate_Critical
  expr: (rate_1h / budget > 14.4) and (rate_5m / budget > 14.4)
  severity: critical

# Warning: Slower burn (6x over 6h AND 30m)
- alert: HighBurnRate_Warning
  expr: (rate_6h / budget > 6) and (rate_30m / budget > 6)
  severity: warning

Dashboard Essentials

  • Current burn rate
  • Error budget remaining (%)
  • Time until exhaustion at current rate

Anti-Patterns

  • Too many SLOs → SLO per user journey, not per endpoint
  • Alerting on raw error rate → Noisy, doesn't account for budget
  • No budget visualization → Teams don't understand burn rate

References

  • references/methodology/sli-slo-framework.md
slo-alerting – AI Agent Skills | Claude Skills