error-analysis
This skill should be used when analyzing errors, stack traces, and logs to identify root causes and implement fixes.
When & Why to Use This Skill
This Claude skill provides a systematic framework for software troubleshooting, enabling users to rapidly analyze stack traces, decipher complex error logs, and perform deep-dive root cause analysis. It streamlines the transition from error detection to code-level fixes, helping developers reduce Mean Time to Repair (MTTR) and improve system stability through structured debugging workflows.
Use Cases
- Production Incident Debugging: Analyzing real-time stack traces and environment context to identify and resolve critical runtime errors like null pointer exceptions or race conditions.
- Log Pattern Recognition: Correlating distributed logs using request IDs to reconstruct event timelines and identify the source of intermittent failures.
- Root Cause Analysis (RCA): Conducting structured investigations into system regressions to document why an error occurred and generate comprehensive post-mortem reports.
- Automated Error Categorization: Classifying incoming errors by severity and type (e.g., Infrastructure vs. Application) to prioritize bug fixes and resource allocation.
| name | error-analysis |
|---|---|
| description | This skill should be used when analyzing errors, stack traces, and logs to identify root causes and implement fixes. |
Error Analysis Skill
Systematically analyze errors and logs to find root causes.
When to Use
- Debugging production errors
- Analyzing stack traces
- Investigating log patterns
- Performing root cause analysis
- Categorizing error types
Reference Documents
- Log Patterns - Common error patterns by type
- Root Cause Analysis - RCA techniques
- Error Categorization - Classifying errors
- Fix Patterns - Common fixes by error type
Analysis Workflow
1. Gather Information
## Error Report
### Error Message
TypeError: Cannot read property 'id' of undefined at UserService.getUser (src/services/user.ts:45:23) at async Router.handle (src/api/routes.ts:67:12)
### Context
- **Environment:** Production
- **Time:** 2024-01-15 14:30 UTC
- **Frequency:** 15 occurrences in last hour
- **Affected Users:** ~5% of requests
- **Recent Changes:** Deploy at 14:00 UTC
2. Categorize Error
## Error Classification
**Type:** Runtime Error
**Category:** Null/Undefined Reference
**Severity:** High (user-facing)
### Common Causes
1. Missing data validation
2. Race condition
3. API contract change
4. Data migration issue
3. Root Cause Analysis
## Root Cause Investigation
### Timeline
- 14:00 - Deploy to production
- 14:25 - First error reported
- 14:30 - Error rate increased
### Hypothesis 1: Deploy introduced bug
- Check: git diff between versions
- Result: New code path added
### Hypothesis 2: Data issue
- Check: Query for affected users
- Result: All have specific condition
### Root Cause
Deploy introduced code that assumes `user.profile` exists,
but 5% of users don't have profiles (legacy accounts).
4. Implement Fix
## Fix Implementation
### Short-term (Hotfix)
```typescript
// Add null check
const userId = user?.profile?.id ?? user.id;
Long-term
- Add data validation at API boundary
- Migrate legacy accounts
- Add regression test
## Error Categories
### By Origin
| Category | Description | Example |
|----------|-------------|---------|
| Client | User/browser errors | Invalid input |
| Server | Application errors | Null reference |
| Infrastructure | System errors | Connection timeout |
| External | Third-party errors | API rate limit |
### By Severity
| Level | Impact | Response |
|-------|--------|----------|
| Critical | System down | Immediate |
| High | Major feature broken | Hours |
| Medium | Degraded experience | This sprint |
| Low | Minor inconvenience | Backlog |
## Stack Trace Analysis
### Reading Stack Traces
```markdown
## Stack Trace Components
Error: Connection refused ← Error type and message at Database.connect ← Where error was thrown (src/db/connection.ts:23) ← File and line at async initialize ← Call chain (src/server.ts:45) at async main ← Entry point (src/index.ts:12)
### Key Questions
1. Where was the error thrown? (top of stack)
2. What called that code? (stack trace)
3. What was the input/state? (logs)
4. What changed recently? (git history)
Common Patterns
## Null Reference
TypeError: Cannot read property 'x' of undefined
**Check:** Variable existence, API response shape
## Connection Error
Error: ECONNREFUSED 127.0.0.1:5432
**Check:** Service running, network config, credentials
## Timeout
Error: Operation timed out after 30000ms
**Check:** Service health, query performance, network
## Memory
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed
**Check:** Memory leaks, large data processing
Log Analysis
Correlation
## Correlating Logs
### Using Request ID
```bash
grep "req-12345" application.log
Timeline Reconstruction
14:30:00.123 [req-12345] User login started
14:30:00.456 [req-12345] Fetching user profile
14:30:00.789 [req-12345] ERROR: Profile not found
14:30:00.790 [req-12345] TypeError: Cannot read 'id'
Pattern Identification
# Count error types
grep "ERROR" app.log | cut -d: -f2 | sort | uniq -c | sort -rn
# Find error spikes
grep "ERROR" app.log | cut -d' ' -f1 | uniq -c
## Resolution Tracking
### Fix Verification
```markdown
## Fix Verification Checklist
- [ ] Error no longer reproducible locally
- [ ] Unit test added for fix
- [ ] Integration test added
- [ ] Deployed to staging
- [ ] Verified in staging
- [ ] Deployed to production
- [ ] Monitoring error rate
- [ ] Error rate returned to baseline
Post-mortem Template
# Incident Post-mortem: [Title]
## Summary
Brief description of what happened.
## Timeline
- HH:MM - Event
- HH:MM - Detection
- HH:MM - Resolution
## Impact
- Users affected: X
- Duration: Y hours
- Revenue impact: $Z
## Root Cause
Detailed explanation.
## Resolution
What was done to fix it.
## Lessons Learned
1. What went well
2. What went poorly
3. Where we got lucky
## Action Items
- [ ] Prevent: [Action]
- [ ] Detect: [Action]
- [ ] Respond: [Action]