error-analysis

jayteealao's avatarfrom jayteealao

This skill should be used when analyzing errors, stack traces, and logs to identify root causes and implement fixes.

0stars🔀0forks📁View on GitHub🕐Updated Jan 11, 2026

When & Why to Use This Skill

This Claude skill provides a systematic framework for software troubleshooting, enabling users to rapidly analyze stack traces, decipher complex error logs, and perform deep-dive root cause analysis. It streamlines the transition from error detection to code-level fixes, helping developers reduce Mean Time to Repair (MTTR) and improve system stability through structured debugging workflows.

Use Cases

  • Production Incident Debugging: Analyzing real-time stack traces and environment context to identify and resolve critical runtime errors like null pointer exceptions or race conditions.
  • Log Pattern Recognition: Correlating distributed logs using request IDs to reconstruct event timelines and identify the source of intermittent failures.
  • Root Cause Analysis (RCA): Conducting structured investigations into system regressions to document why an error occurred and generate comprehensive post-mortem reports.
  • Automated Error Categorization: Classifying incoming errors by severity and type (e.g., Infrastructure vs. Application) to prioritize bug fixes and resource allocation.
nameerror-analysis
descriptionThis skill should be used when analyzing errors, stack traces, and logs to identify root causes and implement fixes.

Error Analysis Skill

Systematically analyze errors and logs to find root causes.

When to Use

  • Debugging production errors
  • Analyzing stack traces
  • Investigating log patterns
  • Performing root cause analysis
  • Categorizing error types

Reference Documents

Analysis Workflow

1. Gather Information

## Error Report

### Error Message

TypeError: Cannot read property 'id' of undefined at UserService.getUser (src/services/user.ts:45:23) at async Router.handle (src/api/routes.ts:67:12)


### Context
- **Environment:** Production
- **Time:** 2024-01-15 14:30 UTC
- **Frequency:** 15 occurrences in last hour
- **Affected Users:** ~5% of requests
- **Recent Changes:** Deploy at 14:00 UTC

2. Categorize Error

## Error Classification

**Type:** Runtime Error
**Category:** Null/Undefined Reference
**Severity:** High (user-facing)

### Common Causes
1. Missing data validation
2. Race condition
3. API contract change
4. Data migration issue

3. Root Cause Analysis

## Root Cause Investigation

### Timeline
- 14:00 - Deploy to production
- 14:25 - First error reported
- 14:30 - Error rate increased

### Hypothesis 1: Deploy introduced bug
- Check: git diff between versions
- Result: New code path added

### Hypothesis 2: Data issue
- Check: Query for affected users
- Result: All have specific condition

### Root Cause
Deploy introduced code that assumes `user.profile` exists,
but 5% of users don't have profiles (legacy accounts).

4. Implement Fix

## Fix Implementation

### Short-term (Hotfix)
```typescript
// Add null check
const userId = user?.profile?.id ?? user.id;

Long-term

  1. Add data validation at API boundary
  2. Migrate legacy accounts
  3. Add regression test

## Error Categories

### By Origin

| Category | Description | Example |
|----------|-------------|---------|
| Client | User/browser errors | Invalid input |
| Server | Application errors | Null reference |
| Infrastructure | System errors | Connection timeout |
| External | Third-party errors | API rate limit |

### By Severity

| Level | Impact | Response |
|-------|--------|----------|
| Critical | System down | Immediate |
| High | Major feature broken | Hours |
| Medium | Degraded experience | This sprint |
| Low | Minor inconvenience | Backlog |

## Stack Trace Analysis

### Reading Stack Traces

```markdown
## Stack Trace Components

Error: Connection refused ← Error type and message at Database.connect ← Where error was thrown (src/db/connection.ts:23) ← File and line at async initialize ← Call chain (src/server.ts:45) at async main ← Entry point (src/index.ts:12)


### Key Questions
1. Where was the error thrown? (top of stack)
2. What called that code? (stack trace)
3. What was the input/state? (logs)
4. What changed recently? (git history)

Common Patterns

## Null Reference

TypeError: Cannot read property 'x' of undefined

**Check:** Variable existence, API response shape

## Connection Error

Error: ECONNREFUSED 127.0.0.1:5432

**Check:** Service running, network config, credentials

## Timeout

Error: Operation timed out after 30000ms

**Check:** Service health, query performance, network

## Memory

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed

**Check:** Memory leaks, large data processing

Log Analysis

Correlation

## Correlating Logs

### Using Request ID
```bash
grep "req-12345" application.log

Timeline Reconstruction

14:30:00.123 [req-12345] User login started
14:30:00.456 [req-12345] Fetching user profile
14:30:00.789 [req-12345] ERROR: Profile not found
14:30:00.790 [req-12345] TypeError: Cannot read 'id'

Pattern Identification

# Count error types
grep "ERROR" app.log | cut -d: -f2 | sort | uniq -c | sort -rn

# Find error spikes
grep "ERROR" app.log | cut -d' ' -f1 | uniq -c

## Resolution Tracking

### Fix Verification

```markdown
## Fix Verification Checklist

- [ ] Error no longer reproducible locally
- [ ] Unit test added for fix
- [ ] Integration test added
- [ ] Deployed to staging
- [ ] Verified in staging
- [ ] Deployed to production
- [ ] Monitoring error rate
- [ ] Error rate returned to baseline

Post-mortem Template

# Incident Post-mortem: [Title]

## Summary
Brief description of what happened.

## Timeline
- HH:MM - Event
- HH:MM - Detection
- HH:MM - Resolution

## Impact
- Users affected: X
- Duration: Y hours
- Revenue impact: $Z

## Root Cause
Detailed explanation.

## Resolution
What was done to fix it.

## Lessons Learned
1. What went well
2. What went poorly
3. Where we got lucky

## Action Items
- [ ] Prevent: [Action]
- [ ] Detect: [Action]
- [ ] Respond: [Action]