slo-alerting
Define SLIs, SLOs, and implement burn-rate alerting
When & Why to Use This Skill
This Claude skill empowers SRE and DevOps teams to establish robust reliability frameworks by defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs). It facilitates the implementation of advanced burn-rate alerting strategies, moving beyond noisy raw error rates to focus on error budget consumption. By providing structured templates for multi-window strategies and dashboard essentials, it helps teams maintain high service availability while minimizing alert fatigue through industry-standard SRE methodologies.
Use Cases
- Designing Reliability Frameworks: Define quantitative SLIs like availability and latency for critical user journeys to set realistic and measurable SLO targets.
- Implementing Burn-Rate Alerting: Configure multi-window alert strategies (e.g., 14.4x burn over 1h) to detect critical budget exhaustion before it impacts the end-user experience.
- Error Budget Management: Calculate and visualize error budgets to help product and engineering teams balance the pace of feature releases with the necessity of system stability.
- Reducing Alert Fatigue: Transition from raw error rate triggers to sophisticated burn-rate alerts to ensure on-call engineers only respond to incidents that significantly threaten the error budget.
| name | slo-alerting |
|---|---|
| description | "Define SLIs, SLOs, and implement burn-rate alerting" |
| priority | 2 |
SLO Alerting
Define SLIs, set SLO targets, alert on burn rate (not raw error rate).
Concepts
| Term | Definition | Example |
|---|---|---|
| SLI | Quantitative measure | % successful requests |
| SLO | Target for SLI | 99.9% success |
| Error Budget | Allowed failure | 0.1% = 43 min/month |
| Burn Rate | Budget consumption speed | 10x = exhausted in 3 days |
Common SLIs
Availability: successful_requests / total_requests
Latency: requests_under_threshold / total_requests
Error Rate: error_requests / total_requests
Burn Rate Alerting
Alert on how fast you're consuming budget, not raw error rate:
| Alert Level | Burn Rate | Time to Exhaust |
|---|---|---|
| Page (critical) | 14.4x | 2 days |
| Page (warning) | 6x | 5 days |
| Ticket (medium) | 3x | 10 days |
Multi-Window Strategy
Use long + short windows to balance speed and noise:
# Critical: Fast burn (14.4x over 1h AND 5m)
- alert: HighBurnRate_Critical
expr: (rate_1h / budget > 14.4) and (rate_5m / budget > 14.4)
severity: critical
# Warning: Slower burn (6x over 6h AND 30m)
- alert: HighBurnRate_Warning
expr: (rate_6h / budget > 6) and (rate_30m / budget > 6)
severity: warning
Dashboard Essentials
- Current burn rate
- Error budget remaining (%)
- Time until exhaustion at current rate
Anti-Patterns
- Too many SLOs → SLO per user journey, not per endpoint
- Alerting on raw error rate → Noisy, doesn't account for budget
- No budget visualization → Teams don't understand burn rate
References
references/methodology/sli-slo-framework.md