What is systematic-debugging?

The Systematic Debugging skill provides a structured 4-phase Root Cause Analysis (RCA) framework designed to identify, isolate, and resolve complex software issues efficiently. By following a rigorous process of reproduction, evidence gathering, hypothesis formulation, and systematic testing, it helps developers avoid trial-and-error approaches and implement long-term fixes for logic errors, performance bottlenecks, and system failures.

When should I use systematic-debugging?

systematic-debugging is useful in the following scenarios: • Case 1: Identifying and fixing elusive logic errors such as off-by-one mistakes, null pointer exceptions, and incorrect state handling in various programming languages. • Case 2: Troubleshooting concurrency and timing issues, including race conditions in multi-threaded applications and async/await errors in modern JavaScript environments. • Case 3: Diagnosing system-level performance degradation and memory leaks using profiling tools, log analysis, and resource monitoring. • Case 4: Resolving API integration failures and database bottlenecks by analyzing request/response cycles and optimizing slow SQL queries. • Case 5: Implementing defensive programming and observability strategies to prevent bug recurrence and improve system reliability through structured logging and health checks.

name	Systematic Debugging
description	This skill should be used when the user asks to "debug this issue", "find the root cause", "investigate the bug", "systematic debugging", "troubleshoot the problem", or needs help with systematic problem-solving and root cause analysis.
version	1.0.0

Systematic Debugging: 4-Phase Root Cause Analysis

Overview

Systematic debugging follows a structured 4-phase approach to identify, isolate, and resolve issues efficiently. This methodology prevents the common debugging trap of random changes and ensures comprehensive problem-solving with reproducible results.

The 4-Phase Debugging Process

Phase 1: REPRODUCE - Isolate the Problem

Establish a reliable way to reproduce the issue consistently.

Objectives:

Create minimal reproduction case
Document exact steps to trigger the bug
Identify environmental factors
Establish success/failure criteria

Key Questions:

What exactly happens vs. what should happen?
Under what conditions does it occur?
Can you reproduce it consistently?
What's the minimal case that shows the problem?

Phase 2: GATHER - Collect Evidence

Systematically collect all available information about the issue.

Data Sources:

Error messages and stack traces
Log files and application output
System metrics and performance data
User reports and behavioral patterns
Code changes and deployment history

Evidence Types:

Direct evidence: Error messages, exceptions, failures
Circumstantial evidence: Timing, environment, patterns
Historical evidence: When did it start? What changed?

Phase 3: HYPOTHESIZE - Generate Theories

Develop testable theories about the root cause based on evidence.

Hypothesis Framework:

Input hypothesis: Problem in data or user input
Logic hypothesis: Bug in business logic or algorithms
Environment hypothesis: System, infrastructure, or configuration issue
Integration hypothesis: Problem in external dependencies or APIs

Validation Criteria:

Each hypothesis must be testable
Evidence should support or refute the theory
Prioritize hypotheses by probability and impact

Phase 4: TEST - Validate and Fix

Test each hypothesis systematically and implement verified solutions.

Testing Approach:

Test hypotheses in order of likelihood
Change one variable at a time
Document test results
Verify fix resolves the original issue

Debugging Toolbox

Code-Level Debugging

Print/Log Debugging:

# Strategic print statements
print(f"DEBUG: Variable x = {x}, type = {type(x)}")
print(f"DEBUG: Function entry - params: {locals()}")

Interactive Debuggers:

# Python
python -m pdb script.py
breakpoint()  # Python 3.7+

# JavaScript/Node.js
node --inspect-brk script.js
debugger;  // Breakpoint in code

Assertion Debugging:

# Validate assumptions
assert user_id is not None, f"User ID should not be None at this point"
assert len(items) > 0, f"Items list should not be empty: {items}"

System-Level Debugging

Log Analysis:

# Search for patterns
grep -i "error" /var/log/application.log
tail -f /var/log/application.log | grep "user_id=123"

# Analyze timing patterns
awk '{print $4}' access.log | sort | uniq -c

Performance Analysis:

# CPU and Memory
top -p $(pgrep python)
ps aux | grep "my_application"

# Network debugging
netstat -tulpn | grep :8080
curl -v http://localhost:8080/api/health

Database Debugging:

-- Query performance
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';

-- Lock analysis
SELECT * FROM pg_locks WHERE NOT granted;

-- Slow query log analysis
SELECT query, mean_time, calls FROM pg_stat_statements ORDER BY mean_time DESC;

Common Bug Patterns

Logic Errors

Off-by-One Errors:

# Bug: Missing last element
for i in range(len(array) - 1):  # Should be len(array)
    process(array[i])

# Fix: Include all elements
for i in range(len(array)):
    process(array[i])

Null/Undefined Handling:

// Bug: Doesn't handle null case
function processUser(user) {
    return user.name.toUpperCase();  // Crashes if user is null
}

// Fix: Add null checks
function processUser(user) {
    return user?.name?.toUpperCase() || 'Unknown';
}

Timing and Concurrency Issues

Race Conditions:

# Bug: Race condition in counter
class Counter:
    def __init__(self):
        self.count = 0

    def increment(self):
        temp = self.count
        temp += 1
        self.count = temp  # Not atomic

# Fix: Use proper synchronization
import threading

class Counter:
    def __init__(self):
        self.count = 0
        self.lock = threading.Lock()

    def increment(self):
        with self.lock:
            self.count += 1

Async/Await Issues:

// Bug: Not awaiting async function
async function fetchData() {
    const result = api.getData();  // Missing await
    return result.id;  // Tries to access property on Promise
}

// Fix: Proper async handling
async function fetchData() {
    const result = await api.getData();
    return result.id;
}

Resource Management Issues

Memory Leaks:

# Bug: Circular references
class Parent:
    def __init__(self):
        self.children = []

    def add_child(self, child):
        child.parent = self  # Circular reference
        self.children.append(child)

# Fix: Use weak references
import weakref

class Parent:
    def __init__(self):
        self.children = []

    def add_child(self, child):
        child.parent = weakref.ref(self)
        self.children.append(child)

Debugging Strategies by Context

Web Application Debugging

Client-Side Issues:

Check browser console for JavaScript errors
Inspect network tab for failed requests
Validate form data and API payloads
Test across different browsers and devices

Server-Side Issues:

Check application logs for errors
Monitor database query performance
Validate API request/response cycles
Check server resource utilization

API Debugging

Request/Response Debugging:

# Test API endpoints
curl -X POST http://api.example.com/users \
  -H "Content-Type: application/json" \
  -d '{"name": "Test User"}' \
  -v

# Check authentication
curl -H "Authorization: Bearer token123" \
  http://api.example.com/protected \
  -v

Database Integration:

# Add query logging
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)

Performance Debugging

Profiling Code:

# Python profiling
import cProfile
cProfile.run('main()')

# Line-by-line profiling
from line_profiler import LineProfiler
profiler = LineProfiler()
profiler.add_function(my_function)
profiler.run('main()')

Memory Profiling:

# Memory usage tracking
import tracemalloc
tracemalloc.start()

# Your code here

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {peak / 1024 / 1024:.1f} MB")
tracemalloc.stop()

Debugging Checklist

Phase 1: REPRODUCE

Document exact error message or unexpected behavior
Identify steps to reproduce consistently
Note environmental factors (OS, browser, data)
Create minimal test case
Verify issue exists in different environments

Phase 2: GATHER

Collect all error messages and stack traces
Review relevant log files
Check system metrics (CPU, memory, disk)
Identify recent changes (code, configuration, data)
Gather user reports and patterns

Phase 3: HYPOTHESIZE

List possible root causes
Prioritize hypotheses by likelihood
Define tests for each hypothesis
Consider both direct and indirect causes
Review similar past issues

Phase 4: TEST

Test hypotheses systematically
Change only one variable at a time
Document test results
Verify fix resolves original issue
Test for regression in other areas

Advanced Debugging Techniques

Binary Search Debugging

When dealing with large codebases or data sets:

# Git bisect for finding regression
git bisect start
git bisect bad HEAD
git bisect good v1.0.0
# Git will check out commits to test
git bisect run ./test_script.sh

Rubber Duck Debugging

Explain the problem step-by-step to:

Identify assumptions and gaps
Clarify understanding
Generate new hypotheses
Spot overlooked details

Collaborative Debugging

Pair Debugging:

Fresh perspective on the problem
Knowledge sharing and learning
Faster hypothesis generation
Reduced debugging tunnel vision

Debug Sessions:

Screen sharing for real-time collaboration
Systematic walkthrough of the issue
Collective problem-solving

Prevention Strategies

Defensive Programming

Input Validation:

def process_user_data(data):
    if not isinstance(data, dict):
        raise ValueError(f"Expected dict, got {type(data)}")

    if 'email' not in data:
        raise ValueError("Missing required field: email")

    if not data['email'] or '@' not in data['email']:
        raise ValueError(f"Invalid email format: {data['email']}")

Error Handling:

def fetch_user_profile(user_id):
    try:
        response = api_client.get(f"/users/{user_id}")
        return response.json()
    except requests.exceptions.ConnectionError:
        logger.error(f"Failed to connect to API for user {user_id}")
        raise
    except requests.exceptions.Timeout:
        logger.error(f"API timeout for user {user_id}")
        raise
    except Exception as e:
        logger.error(f"Unexpected error fetching user {user_id}: {e}")
        raise

Monitoring and Observability

Structured Logging:

import structlog

logger = structlog.get_logger()

def process_order(order_id):
    logger.info("Processing order", order_id=order_id)
    try:
        # Process order
        logger.info("Order processed successfully", order_id=order_id)
    except Exception as e:
        logger.error("Order processing failed",
                   order_id=order_id,
                   error=str(e))
        raise

Health Checks:

def health_check():
    checks = {
        "database": check_database_connection(),
        "cache": check_cache_connection(),
        "external_api": check_external_api(),
    }

    all_healthy = all(checks.values())

    return {
        "status": "healthy" if all_healthy else "unhealthy",
        "checks": checks
    }

Additional Resources

Reference Files

For detailed debugging patterns and advanced techniques, consult:

references/debugging-patterns.md - Common debugging patterns and anti-patterns
references/tool-specific-guides.md - Debugging guides for specific tools and frameworks
references/performance-debugging.md - Performance debugging and profiling techniques

Example Files

Working debugging examples in examples/:

examples/web-app-debugging.py - Complete web application debugging workflow
examples/api-debugging-session.py - API debugging scenarios
examples/performance-issue-analysis.py - Performance debugging example

Scripts

Debugging utility scripts in scripts/:

scripts/debug-session-logger.sh - Automated debugging session logging
scripts/log-analyzer.py - Log file analysis and pattern detection
scripts/system-health-check.sh - Comprehensive system health validation

Success Metrics

Debugging Efficiency

Time to identify root cause
Number of hypotheses tested
Accuracy of initial hypothesis
Resolution time

Quality Improvement

Reduced bug recurrence
Improved error handling
Better monitoring coverage
Enhanced system reliability

Follow the 4-phase systematic approach to debug issues efficiently and build more robust systems through better understanding of failure modes and prevention strategies.

systematic-debugging

When & Why to Use This Skill

Use Cases