systematic-debugging
This skill should be used when the user asks to "debug this issue", "find the root cause", "investigate the bug", "systematic debugging", "troubleshoot the problem", or needs help with systematic problem-solving and root cause analysis.
When & Why to Use This Skill
The Systematic Debugging skill provides a structured 4-phase Root Cause Analysis (RCA) framework designed to identify, isolate, and resolve complex software issues efficiently. By following a rigorous process of reproduction, evidence gathering, hypothesis formulation, and systematic testing, it helps developers avoid trial-and-error approaches and implement long-term fixes for logic errors, performance bottlenecks, and system failures.
Use Cases
- Case 1: Identifying and fixing elusive logic errors such as off-by-one mistakes, null pointer exceptions, and incorrect state handling in various programming languages.
- Case 2: Troubleshooting concurrency and timing issues, including race conditions in multi-threaded applications and async/await errors in modern JavaScript environments.
- Case 3: Diagnosing system-level performance degradation and memory leaks using profiling tools, log analysis, and resource monitoring.
- Case 4: Resolving API integration failures and database bottlenecks by analyzing request/response cycles and optimizing slow SQL queries.
- Case 5: Implementing defensive programming and observability strategies to prevent bug recurrence and improve system reliability through structured logging and health checks.
| name | Systematic Debugging |
|---|---|
| description | This skill should be used when the user asks to "debug this issue", "find the root cause", "investigate the bug", "systematic debugging", "troubleshoot the problem", or needs help with systematic problem-solving and root cause analysis. |
| version | 1.0.0 |
Systematic Debugging: 4-Phase Root Cause Analysis
Overview
Systematic debugging follows a structured 4-phase approach to identify, isolate, and resolve issues efficiently. This methodology prevents the common debugging trap of random changes and ensures comprehensive problem-solving with reproducible results.
The 4-Phase Debugging Process
Phase 1: REPRODUCE - Isolate the Problem
Establish a reliable way to reproduce the issue consistently.
Objectives:
- Create minimal reproduction case
- Document exact steps to trigger the bug
- Identify environmental factors
- Establish success/failure criteria
Key Questions:
- What exactly happens vs. what should happen?
- Under what conditions does it occur?
- Can you reproduce it consistently?
- What's the minimal case that shows the problem?
Phase 2: GATHER - Collect Evidence
Systematically collect all available information about the issue.
Data Sources:
- Error messages and stack traces
- Log files and application output
- System metrics and performance data
- User reports and behavioral patterns
- Code changes and deployment history
Evidence Types:
- Direct evidence: Error messages, exceptions, failures
- Circumstantial evidence: Timing, environment, patterns
- Historical evidence: When did it start? What changed?
Phase 3: HYPOTHESIZE - Generate Theories
Develop testable theories about the root cause based on evidence.
Hypothesis Framework:
- Input hypothesis: Problem in data or user input
- Logic hypothesis: Bug in business logic or algorithms
- Environment hypothesis: System, infrastructure, or configuration issue
- Integration hypothesis: Problem in external dependencies or APIs
Validation Criteria:
- Each hypothesis must be testable
- Evidence should support or refute the theory
- Prioritize hypotheses by probability and impact
Phase 4: TEST - Validate and Fix
Test each hypothesis systematically and implement verified solutions.
Testing Approach:
- Test hypotheses in order of likelihood
- Change one variable at a time
- Document test results
- Verify fix resolves the original issue
Debugging Toolbox
Code-Level Debugging
Print/Log Debugging:
# Strategic print statements
print(f"DEBUG: Variable x = {x}, type = {type(x)}")
print(f"DEBUG: Function entry - params: {locals()}")
Interactive Debuggers:
# Python
python -m pdb script.py
breakpoint() # Python 3.7+
# JavaScript/Node.js
node --inspect-brk script.js
debugger; // Breakpoint in code
Assertion Debugging:
# Validate assumptions
assert user_id is not None, f"User ID should not be None at this point"
assert len(items) > 0, f"Items list should not be empty: {items}"
System-Level Debugging
Log Analysis:
# Search for patterns
grep -i "error" /var/log/application.log
tail -f /var/log/application.log | grep "user_id=123"
# Analyze timing patterns
awk '{print $4}' access.log | sort | uniq -c
Performance Analysis:
# CPU and Memory
top -p $(pgrep python)
ps aux | grep "my_application"
# Network debugging
netstat -tulpn | grep :8080
curl -v http://localhost:8080/api/health
Database Debugging:
-- Query performance
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
-- Lock analysis
SELECT * FROM pg_locks WHERE NOT granted;
-- Slow query log analysis
SELECT query, mean_time, calls FROM pg_stat_statements ORDER BY mean_time DESC;
Common Bug Patterns
Logic Errors
Off-by-One Errors:
# Bug: Missing last element
for i in range(len(array) - 1): # Should be len(array)
process(array[i])
# Fix: Include all elements
for i in range(len(array)):
process(array[i])
Null/Undefined Handling:
// Bug: Doesn't handle null case
function processUser(user) {
return user.name.toUpperCase(); // Crashes if user is null
}
// Fix: Add null checks
function processUser(user) {
return user?.name?.toUpperCase() || 'Unknown';
}
Timing and Concurrency Issues
Race Conditions:
# Bug: Race condition in counter
class Counter:
def __init__(self):
self.count = 0
def increment(self):
temp = self.count
temp += 1
self.count = temp # Not atomic
# Fix: Use proper synchronization
import threading
class Counter:
def __init__(self):
self.count = 0
self.lock = threading.Lock()
def increment(self):
with self.lock:
self.count += 1
Async/Await Issues:
// Bug: Not awaiting async function
async function fetchData() {
const result = api.getData(); // Missing await
return result.id; // Tries to access property on Promise
}
// Fix: Proper async handling
async function fetchData() {
const result = await api.getData();
return result.id;
}
Resource Management Issues
Memory Leaks:
# Bug: Circular references
class Parent:
def __init__(self):
self.children = []
def add_child(self, child):
child.parent = self # Circular reference
self.children.append(child)
# Fix: Use weak references
import weakref
class Parent:
def __init__(self):
self.children = []
def add_child(self, child):
child.parent = weakref.ref(self)
self.children.append(child)
Debugging Strategies by Context
Web Application Debugging
Client-Side Issues:
- Check browser console for JavaScript errors
- Inspect network tab for failed requests
- Validate form data and API payloads
- Test across different browsers and devices
Server-Side Issues:
- Check application logs for errors
- Monitor database query performance
- Validate API request/response cycles
- Check server resource utilization
API Debugging
Request/Response Debugging:
# Test API endpoints
curl -X POST http://api.example.com/users \
-H "Content-Type: application/json" \
-d '{"name": "Test User"}' \
-v
# Check authentication
curl -H "Authorization: Bearer token123" \
http://api.example.com/protected \
-v
Database Integration:
# Add query logging
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)
Performance Debugging
Profiling Code:
# Python profiling
import cProfile
cProfile.run('main()')
# Line-by-line profiling
from line_profiler import LineProfiler
profiler = LineProfiler()
profiler.add_function(my_function)
profiler.run('main()')
Memory Profiling:
# Memory usage tracking
import tracemalloc
tracemalloc.start()
# Your code here
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {peak / 1024 / 1024:.1f} MB")
tracemalloc.stop()
Debugging Checklist
Phase 1: REPRODUCE
- Document exact error message or unexpected behavior
- Identify steps to reproduce consistently
- Note environmental factors (OS, browser, data)
- Create minimal test case
- Verify issue exists in different environments
Phase 2: GATHER
- Collect all error messages and stack traces
- Review relevant log files
- Check system metrics (CPU, memory, disk)
- Identify recent changes (code, configuration, data)
- Gather user reports and patterns
Phase 3: HYPOTHESIZE
- List possible root causes
- Prioritize hypotheses by likelihood
- Define tests for each hypothesis
- Consider both direct and indirect causes
- Review similar past issues
Phase 4: TEST
- Test hypotheses systematically
- Change only one variable at a time
- Document test results
- Verify fix resolves original issue
- Test for regression in other areas
Advanced Debugging Techniques
Binary Search Debugging
When dealing with large codebases or data sets:
# Git bisect for finding regression
git bisect start
git bisect bad HEAD
git bisect good v1.0.0
# Git will check out commits to test
git bisect run ./test_script.sh
Rubber Duck Debugging
Explain the problem step-by-step to:
- Identify assumptions and gaps
- Clarify understanding
- Generate new hypotheses
- Spot overlooked details
Collaborative Debugging
Pair Debugging:
- Fresh perspective on the problem
- Knowledge sharing and learning
- Faster hypothesis generation
- Reduced debugging tunnel vision
Debug Sessions:
- Screen sharing for real-time collaboration
- Systematic walkthrough of the issue
- Collective problem-solving
Prevention Strategies
Defensive Programming
Input Validation:
def process_user_data(data):
if not isinstance(data, dict):
raise ValueError(f"Expected dict, got {type(data)}")
if 'email' not in data:
raise ValueError("Missing required field: email")
if not data['email'] or '@' not in data['email']:
raise ValueError(f"Invalid email format: {data['email']}")
Error Handling:
def fetch_user_profile(user_id):
try:
response = api_client.get(f"/users/{user_id}")
return response.json()
except requests.exceptions.ConnectionError:
logger.error(f"Failed to connect to API for user {user_id}")
raise
except requests.exceptions.Timeout:
logger.error(f"API timeout for user {user_id}")
raise
except Exception as e:
logger.error(f"Unexpected error fetching user {user_id}: {e}")
raise
Monitoring and Observability
Structured Logging:
import structlog
logger = structlog.get_logger()
def process_order(order_id):
logger.info("Processing order", order_id=order_id)
try:
# Process order
logger.info("Order processed successfully", order_id=order_id)
except Exception as e:
logger.error("Order processing failed",
order_id=order_id,
error=str(e))
raise
Health Checks:
def health_check():
checks = {
"database": check_database_connection(),
"cache": check_cache_connection(),
"external_api": check_external_api(),
}
all_healthy = all(checks.values())
return {
"status": "healthy" if all_healthy else "unhealthy",
"checks": checks
}
Additional Resources
Reference Files
For detailed debugging patterns and advanced techniques, consult:
references/debugging-patterns.md- Common debugging patterns and anti-patternsreferences/tool-specific-guides.md- Debugging guides for specific tools and frameworksreferences/performance-debugging.md- Performance debugging and profiling techniques
Example Files
Working debugging examples in examples/:
examples/web-app-debugging.py- Complete web application debugging workflowexamples/api-debugging-session.py- API debugging scenariosexamples/performance-issue-analysis.py- Performance debugging example
Scripts
Debugging utility scripts in scripts/:
scripts/debug-session-logger.sh- Automated debugging session loggingscripts/log-analyzer.py- Log file analysis and pattern detectionscripts/system-health-check.sh- Comprehensive system health validation
Success Metrics
Debugging Efficiency
- Time to identify root cause
- Number of hypotheses tested
- Accuracy of initial hypothesis
- Resolution time
Quality Improvement
- Reduced bug recurrence
- Improved error handling
- Better monitoring coverage
- Enhanced system reliability
Follow the 4-phase systematic approach to debug issues efficiently and build more robust systems through better understanding of failure modes and prevention strategies.