k8s-troubleshooter
Use this skill when users report Kubernetes cluster issues, pod failures, or need incident response.Comprehensive troubleshooting for diagnosing cluster, workload, networking, storage, and Helm issues.Invoke for: pods not starting (Pending, CrashLoopBackOff, ImagePull), service connectivity problems,DNS resolution failures, storage/PVC issues, node health problems, CNI/Calico networking, Helm releasefailures, or cluster-wide performance degradation. Provides systematic diagnostic workflows, standardizedinvestigation reports with severity-based depth (Executive Triage Cards for rapid decision-making),and incident response playbooks with phased triage (baseline → inspect → correlate → deep dive).All commands are read-only by default for production safety.
When & Why to Use This Skill
The Kubernetes Troubleshooter skill is a comprehensive diagnostic framework designed to accelerate incident response and root cause analysis within K8s environments. It solves the complexity of cluster debugging by providing automated triage scripts, systematic investigation workflows, and standardized reporting templates. By focusing on production safety with read-only defaults, it empowers SREs and DevOps engineers to resolve pod failures, networking bottlenecks, and storage issues efficiently while maintaining high-quality documentation for stakeholders.
Use Cases
- Diagnosing pod startup and lifecycle failures such as CrashLoopBackOff, ImagePullBackOff, and OOMKilled events.
- Troubleshooting cluster-wide networking issues, including DNS resolution failures and service connectivity bottlenecks.
- Investigating storage and persistence problems related to PVC/PV binding, CSI driver errors, and volume attachments.
- Generating standardized 'Executive Triage Cards' and detailed incident reports to facilitate rapid decision-making during outages.
- Debugging failed Helm releases by analyzing hooks, template rendering, and configuration drift.
- Performing automated cluster health assessments to identify node resource pressure and control plane degradation.
| name | k8s-troubleshooter |
|---|---|
| version | "0.1.0" |
| description | | |
| Invoke for | pods not starting (Pending, CrashLoopBackOff, ImagePull), service connectivity problems, |
Kubernetes Troubleshooter Skill
Overview
This skill encodes expert Kubernetes troubleshooting workflows for diagnosing complex cluster issues. It provides systematic investigation methodologies, production-safe diagnostic patterns, and incident response playbooks that guide you through identifying root causes rather than just treating symptoms.
Core Capabilities:
- Pod lifecycle diagnostics (Pending, CrashLoopBackOff, OOMKilled, ImagePull failures)
- Service connectivity and DNS troubleshooting
- Storage/CSI and PVC/PV issues
- Node health and resource pressure
- Network policy and CNI (Calico) debugging
- Helm chart and release troubleshooting
- Cluster-wide health checks and control plane diagnostics
Safety First: All diagnostic commands are read-only unless explicitly marked as remediation. This prevents making production incidents worse while investigating.
Progressive Disclosure: Core workflows are in this file. Deep dives available in references/ when needed.
Incident Response (Start Here for Production Issues)
When to use: Any production incident, outage, or unexpected cluster behavior.
Automated Incident Triage (Primary Method)
# Run comprehensive incident triage
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh
# Quick triage without full cluster dump (faster, recommended for urgent incidents)
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh --skip-dump
# Scope to specific namespace
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh --namespace production
# Custom output directory
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh --output-dir /tmp/incident-20231201
The incident triage script:
- Captures evidence - Preserves cluster state before investigation (nodes, pods, events, optional cluster-info dump)
- Checks control plane - Uses
/readyz?verbosefor component-level health status - Assesses blast radius - Classifies impact: single pod, namespace, multiple namespaces, or cluster-wide
- Classifies symptoms - Detects crash loops, OOM, scheduling failures, DNS/network issues, storage problems
- Recommends workflows - Provides specific diagnostic scripts and commands based on detected symptoms
- Generates report - Creates markdown report with executive summary and text summary for quick reference
Output: Triage report with blast radius, symptoms, recommended next steps, and captured evidence.
Use this FIRST when responding to production incidents. It provides the systematic assessment needed to guide investigation without making the situation worse.
Decision Tree Quick Reference
See references/incident-response.md for complete triage decision tree and investigation workflows.
Quick symptom-to-workflow mapping:
- Pods Pending →
pod_diagnostics.sh(check scheduling, resources, taints) - CrashLoopBackOff →
pod_diagnostics.sh -l -p(check logs, exit codes) - OOMKilled →
pod_diagnostics.sh(check memory limits, usage) - DNS/Network issues →
network_debug.sh(check CoreDNS, endpoints, policies) - Storage failures →
storage_check.sh(check PVC, CSI driver, attachments) - Node problems →
cluster_health_check.sh(check conditions, pressure) - Control plane degraded →
cluster_health_check.sh(check components, API server)
First 5 Minutes Checklist
When a production incident is reported:
- Assess urgency - Can you access the cluster? Are users impacted?
- Run incident triage -
incident_triage.sh --skip-dumpfor fast assessment - Stabilize if critical - Consider immediate actions (scale, rollback) only if necessary
- Preserve evidence - Triage script captures this automatically
- Follow recommendations - Use triage report to guide investigation
Remember: Evidence first, action second. The triage script preserves state before changes.
Standardized Report Generation
After completing an investigation, generate a structured report that enables rapid decision-making, clear auditing, and trustworthy communication. This format separates verified facts from inferences, ranks hypotheses with falsification tests, and provides explicit proof-of-work documentation.
Report Depth Guidelines
| Severity | Executive Card | Other Sections |
|---|---|---|
| P1/P2 | Full detail | Full detail |
| P3 | Full detail | Concise (primary hypothesis only, abbreviated evidence) |
| P4 | Abbreviated | Minimal (summary only) |
For P3/P4 incidents:
- Low-confidence hypotheses can be 1-line bullets
- Supporting Evidence can reference earlier sections instead of repeating
- "What Changed" can be omitted if unknown
Report Template Overview
Every investigation report should follow this 7-section structure:
| Section | Purpose | Time to Read |
|---|---|---|
| 0. Executive Triage Card | 30-60 second decision summary | 30 sec |
| 1. Problem Statement | What's broken and success criteria | 1 min |
| 2. Assessment & Findings | Facts, inferences, and scope | 2-3 min |
| 3. Root Cause Analysis | Ranked hypotheses with evidence | 2-3 min |
| 4. Remediation Plan | Actions, validation, rollback | 2 min |
| 5. Proof of Work | Commands executed, sources consulted | 1 min |
| 6. Supporting Evidence | Log excerpts, kubectl output | Reference |
Report Template - MANDATORY STRUCTURE
CRITICAL: This is a literal template. Fill in the placeholders but DO NOT:
- Add, remove, or rename sections
- Add subsections beyond those shown
- Rearrange section order
- Create alternative formats
Copy this template and fill in the bracketed placeholders:
---BEGIN MANDATORY TEMPLATE---
Kubernetes Incident Triage Report
0. Executive Triage Card
| Field | Value |
|---|---|
| Status | [Choose ONE: 🔴 Active Incident / 🟡 Degraded / 🟢 Resolved] |
| Severity | [P1/P2/P3/P4] |
| Impact | [Single Pod / Namespace / Multi-Namespace / Cluster-Wide] |
| Duration | Started: [timestamp], Ongoing: [duration] |
Primary Hypothesis
[One clear sentence] — Confidence: [High/Medium/Low]
⚠️ Most Dangerous Assumption: [What assumption, if wrong, would invalidate this approach]
Top 3 Recommended Actions
| Priority | Action | Expected Outcome | Risk |
|---|---|---|---|
| 1 | [Specific kubectl command or action] | [What this fixes] | [Low/Med/High] |
| 2 | [Specific kubectl command or action] | [What this fixes] | [Low/Med/High] |
| 3 | [Specific kubectl command or action] | [What this fixes] | [Low/Med/High] |
Escalation Triggers
- Escalate if: [specific measurable condition]
- Escalate if: [specific measurable condition]
Alternative Hypotheses Considered
- [H2 name] — Confidence: [X%], Why deprioritized: [brief reason]
- [H3 name] — Confidence: [X%], Why deprioritized: [brief reason]
1. Problem Statement
Symptoms
- [Observable symptom 1]
- [Observable symptom 2]
- [Observable symptom 3]
Timeline
- First detected: [timestamp, detection method]
- Reported by: [source: alert/user/automated]
- Current duration: [time since detection]
Exit Criteria Checklist
- [Specific measurable outcome]
- [Specific measurable outcome]
- [Specific measurable outcome]
2. Assessment & Findings
Classification
| Aspect | Value |
|---|---|
| Category | [Pod/Service/Storage/Node/Network/Helm/Cluster] |
| Blast Radius | [Single pod / Namespace / Multi-namespace / Cluster-wide] |
| User Impact | [None / Degraded / Partial outage / Full outage] |
Scope
| Status | Resources |
|---|---|
| Confirmed Affected | [list pods/services/nodes] |
| Suspected Affected | [list resources] |
| Confirmed Unaffected | [list key resources verified healthy] |
Observed Facts
Facts are directly verifiable from kubectl output, logs, or metrics.
- [FACT-1] [Description] — Source:
[kubectl command or log reference] - [FACT-2] [Description] — Source:
[kubectl command or log reference] - [FACT-3] [Description] — Source:
[kubectl command or log reference]
Derived Inferences
Inferences are conclusions drawn from facts.
- [INF-1] [Inference statement] — Confidence: [High/Medium/Low]
- Based on: FACT-1, FACT-2
- Could be wrong if: [falsification condition]
- [INF-2] [Inference statement] — Confidence: [High/Medium/Low]
- Based on: FACT-3
- Could be wrong if: [falsification condition]
What Changed
| When | What Changed | Who/What | Correlation |
|---|---|---|---|
| [timestamp] | [change description] | [actor/system] | [likely/possible/unlikely] |
Constraints Encountered
- [Access limitation, missing data, or tool constraint]
- [Additional constraint if applicable]
3. Root Cause Analysis
H1: [Primary Hypothesis Name] — Confidence: [High/Medium/Low]
Statement: [Clear, specific hypothesis about root cause]
Evidence For:
- [Supporting evidence] (references FACT-n)
- [Supporting evidence] (references FACT-n)
Evidence Against:
- [Contradicting evidence or gaps]
Falsification Test:
- Command:
[specific kubectl command] - Expected if TRUE: [specific result]
- Expected if FALSE: [alternative result]
H2: [Secondary Hypothesis Name] — Confidence: [Medium/Low]
Statement: [Alternative root cause]
Evidence For:
- [Supporting evidence]
Evidence Against:
- [Why less likely than H1]
Falsification Test:
- Command:
[specific command] - Expected if TRUE: [result]
- Expected if FALSE: [result]
Remaining Unknowns
- [Unknown 1]: [Why this matters for the investigation]
- [Unknown 2]: [What additional access/data would resolve this]
4. Remediation Plan
Immediate Mitigation
| Step | Action | Validation | Rollback |
|---|---|---|---|
| 1 | [Specific action/command] | [How to verify success] | [How to undo] |
| 2 | [Specific action/command] | [How to verify success] | [How to undo] |
Fix Forward
| Step | Action | Validation | Owner |
|---|---|---|---|
| 1 | [Permanent fix description] | [Success criteria] | [Team/person] |
| 2 | [Follow-up action] | [Success criteria] | [Team/person] |
Prevention Improvements
| Category | Recommendation | Priority |
|---|---|---|
| Monitoring | [Specific alert or metric to add] | High/Med/Low |
| Process | [Process change description] | High/Med/Low |
| Architecture | [Infrastructure improvement] | High/Med/Low |
5. Proof of Work
Inputs Consulted
| Source Type | Source | Findings |
|---|---|---|
| Logs | [pod/container name] | [Key finding from logs] |
| Events | [kubectl events scope] | [Key finding from events] |
| Metrics | [prometheus/grafana] | [Key observation] |
Commands Executed
# [Category description]
[command 1]
[command 2]
[command 3]
# [Next category if applicable]
[command 4]
Constraints Documented
- [Time constraint: e.g., investigation limited to X minutes]
- [Access constraint: e.g., no access to node logs]
- [Data constraint: e.g., logs older than Y not available]
6. Supporting Evidence
Log Excerpts
Pod logs: [pod-name] (click to expand)
[paste relevant log lines with timestamps]
kubectl Output
kubectl describe pod [pod-name]
[paste kubectl output]
Additional Evidence
[Description of evidence type]
[paste output or data]
---END MANDATORY TEMPLATE---
Template Compliance Rules
Section Headers:
- MUST use exact section numbers and names (0-6)
- MUST include all sections even if data unavailable
- If data unavailable, state: "Not available due to [reason]"
Formatting:
- MUST preserve table structures
- MUST use
[FACT-n]labels for all facts - MUST use
[INF-n]labels for all inferences - MUST use
<details>tags for collapsible evidence
Content:
- Keep sections concise per P3/P4 depth guidelines
- Use bullet points, not paragraphs, where shown
- Include source attribution for all facts
- Provide falsification tests for all hypotheses
Common Violations to Avoid:
- ❌ Creating sections named "Incident Overview", "Primary Issue", "Key Findings"
- ❌ Mixing facts and inferences without labels
- ❌ Providing hypotheses without Evidence For/Against structure
- ❌ Omitting the "Most Dangerous Assumption"
- ❌ Adding prose paragraphs instead of structured tables
- ❌ Skipping Supporting Evidence section
Section 0: Executive Triage Card
Purpose: Enable incident commanders to make decisions in 30-60 seconds.
## Executive Triage Card
| Field | Value |
|-------|-------|
| **Status** | 🔴 Active Incident / 🟡 Degraded / 🟢 Resolved |
| **Severity** | P1/P2/P3/P4 (see severity table) |
| **Impact** | [Blast radius: pod/namespace/multi-namespace/cluster-wide] |
| **Duration** | Started: [timestamp], Ongoing: [duration] |
### Primary Hypothesis
**[Hypothesis statement]** — Confidence: [High/Medium/Low]
⚠️ **Most Dangerous Assumption**: [What assumption, if wrong, would most change our approach]
### Top 3 Recommended Actions
| Priority | Action | Expected Outcome | Risk |
|----------|--------|------------------|------|
| 1 | [Immediate action] | [What it fixes] | [Low/Med/High] |
| 2 | [Secondary action] | [What it fixes] | [Low/Med/High] |
| 3 | [Tertiary action] | [What it fixes] | [Low/Med/High] |
### Escalation Triggers
- [ ] Escalate if: [condition 1]
- [ ] Escalate if: [condition 2]
### Alternative Hypotheses Considered
1. [H2: Alternative hypothesis] — Confidence: [X%], Why deprioritized: [reason]
2. [H3: Another alternative] — Confidence: [X%], Why deprioritized: [reason]
Section 1: Problem Statement
Purpose: Define what's broken, when it started, and what success looks like.
## Problem Statement
### Symptoms
- [Observable symptom 1]
- [Observable symptom 2]
- [Observable symptom 3]
### Timeline
- **First detected**: [timestamp, detection method]
- **Reported by**: [source: alert, user, automated check]
- **Current duration**: [time since first detection]
### Exit Criteria Checklist
- [ ] [Specific measurable outcome 1]
- [ ] [Specific measurable outcome 2]
- [ ] [Specific measurable outcome 3]
Section 2: Assessment & Findings
Purpose: Separate verified facts from derived inferences with explicit confidence levels.
## Assessment & Findings
### Classification
| Aspect | Value |
|--------|-------|
| **Category** | [Pod/Service/Storage/Node/Network/Helm/Cluster] |
| **Blast Radius** | [Single pod/Namespace/Multi-namespace/Cluster-wide] |
| **User Impact** | [None/Degraded/Partial outage/Full outage] |
### Scope
| Status | Resources |
|--------|-----------|
| **Confirmed Affected** | [pod-1, pod-2, service-x] |
| **Suspected Affected** | [deployment-y, namespace-z] |
| **Confirmed Unaffected** | [namespace-a, service-b] |
### Observed Facts
> Facts are directly verifiable from kubectl output, logs, or metrics.
- **[FACT-1]** [Description] — Source: `kubectl get pods -n namespace`
- **[FACT-2]** [Description] — Source: Pod logs, timestamp
- **[FACT-3]** [Description] — Source: `kubectl describe node`
### Derived Inferences
> Inferences are conclusions drawn from facts. Each includes confidence level.
- **[INF-1]** [Inference statement] — Confidence: [High/Medium/Low]
- Based on: FACT-1, FACT-2
- Could be wrong if: [falsification condition]
- **[INF-2]** [Inference statement] — Confidence: [High/Medium/Low]
- Based on: FACT-3
- Could be wrong if: [falsification condition]
### What Changed
> Recent changes that may be correlated with the incident.
| When | What Changed | Who/What | Correlation |
|------|-------------|----------|-------------|
| [timestamp] | [change description] | [actor] | [likely/possible/unlikely] |
### Constraints Encountered
- [Access limitation 1]
- [Missing data 1]
- [Tool limitation 1]
Section 3: Root Cause Analysis
Purpose: Present ranked hypotheses with evidence and explicit falsification tests.
## Root Cause Analysis
### H1: [Primary Hypothesis] — Confidence: [High/Medium/Low]
**Statement**: [Clear, specific hypothesis about root cause]
**Evidence For**:
- [Supporting evidence 1] (FACT-1)
- [Supporting evidence 2] (FACT-2)
**Evidence Against**:
- [Contradicting evidence, if any]
**Falsification Test**: [Specific test that would disprove this hypothesis]
- Command: `[kubectl command or check]`
- If result is [X], hypothesis is falsified
---
### H2: [Secondary Hypothesis] — Confidence: [Medium/Low]
**Statement**: [Alternative root cause hypothesis]
**Evidence For**:
- [Supporting evidence]
**Evidence Against**:
- [Why this is less likely than H1]
**Falsification Test**: [How to disprove]
---
### H3: [Tertiary Hypothesis] — Confidence: [Low]
**Statement**: [Third alternative]
**Why Deprioritized**: [Specific reason]
---
### Remaining Unknowns
- [ ] [Unknown 1]: Would change analysis if discovered
- [ ] [Unknown 2]: Requires additional access/data
Section 4: Remediation Plan
Purpose: Provide actionable steps with validation and rollback procedures.
## Remediation Plan
### Immediate Mitigation
| Step | Action | Validation | Rollback |
|------|--------|------------|----------|
| 1 | [Action command] | [How to verify success] | [How to undo] |
| 2 | [Action command] | [How to verify success] | [How to undo] |
### Fix Forward
| Step | Action | Validation | Owner |
|------|--------|------------|-------|
| 1 | [Permanent fix action] | [Success criteria] | [Team/person] |
| 2 | [Follow-up action] | [Success criteria] | [Team/person] |
### Prevention Improvements
| Category | Recommendation | Priority |
|----------|---------------|----------|
| Monitoring | [Add alert for X] | High |
| Process | [Change Y procedure] | Medium |
| Architecture | [Consider Z improvement] | Low |
Section 5: Proof of Work
Purpose: Document what was investigated for auditing and future reference.
## Proof of Work
### Inputs Consulted
| Source Type | Source | Findings |
|-------------|--------|----------|
| Logs | [pod/container logs] | [Key finding] |
| Events | [kubectl events] | [Key finding] |
| Metrics | [prometheus/grafana] | [Key finding] |
| Documentation | [runbook, wiki] | [Relevant info] |
| Team Input | [person/channel] | [Key insight] |
### Commands Executed
```bash
# Cluster state assessment
kubectl get pods -n namespace -o wide
kubectl describe pod problem-pod -n namespace
kubectl logs problem-pod -n namespace --tail=100
# Event correlation
kubectl get events -n namespace --sort-by='.lastTimestamp'
# Resource analysis
kubectl top pods -n namespace
kubectl describe node affected-node
Constraints Documented
- [Time constraint: investigation limited to X minutes]
- [Access constraint: no access to Y]
- [Data constraint: logs older than Z not available]
### Section 6: Supporting Evidence
**Purpose**: Provide raw output for verification and future reference.
```markdown
## Supporting Evidence
### Log Excerpts
<details>
<summary>Pod logs: problem-pod (click to expand)</summary>
[timestamp] ERROR: [error message] [timestamp] WARN: [warning message] [timestamp] INFO: [relevant context]
</details>
### kubectl Output
<details>
<summary>kubectl describe pod problem-pod (click to expand)</summary>
Name: problem-pod Namespace: production ... Events: Type Reason Age From Message
Warning BackOff 2m kubelet Back-off restarting failed container
</details>
### Metrics Snapshots
<details>
<summary>Resource usage at incident time (click to expand)</summary>
[Include relevant graphs, metrics, or data points]
</details>
Severity Guidelines
Use these criteria to assign incident severity:
| Severity | Criteria | Response Time | Examples |
|---|---|---|---|
| P1 - Critical | Complete service outage, data loss risk, security breach | Immediate (< 15 min) | Cluster unreachable, all pods crashing, data corruption |
| P2 - High | Major feature unavailable, significant user impact | < 1 hour | Key service degraded, 50%+ pods affected, persistent failures |
| P3 - Medium | Minor feature impact, workaround available | < 4 hours | Single pod issues, intermittent errors, non-critical service |
| P4 - Low | Minimal impact, cosmetic issues | < 24 hours | Warning events, minor config drift, documentation gaps |
Blast Radius Categories
Use these categories to describe incident scope:
| Blast Radius | Description | Typical Causes |
|---|---|---|
| Single Pod | One pod affected, no broader impact | App bug, resource limits, image issues |
| Namespace | Multiple pods in one namespace affected | Namespace quota, shared secret/configmap, network policy |
| Multi-Namespace | Multiple namespaces impacted | Node failure, CNI issues, shared infrastructure |
| Cluster-Wide | Entire cluster degraded or unavailable | Control plane issues, etcd problems, CNI failure, node exhaustion |
Fact vs Inference Methodology
Facts are directly observable and independently verifiable:
- Command output (kubectl get, describe, logs)
- Timestamps from events or logs
- Metric values at specific points in time
- Configuration values from manifests
Inferences are conclusions drawn from facts:
- Root cause determinations
- Correlations between events
- Predictions about behavior
- Assessments of impact
Labeling Convention:
- Prefix facts with
[FACT-n]for traceability - Prefix inferences with
[INF-n]and include:- Confidence level (High/Medium/Low)
- Supporting facts by reference
- Falsification condition ("Could be wrong if...")
Confidence Levels:
- High: Multiple corroborating facts, matches known patterns, no contradicting evidence
- Medium: Some supporting facts, plausible explanation, minor gaps in evidence
- Low: Limited evidence, speculative, alternative explanations equally likely
Hypothesis Ranking
When presenting root cause hypotheses:
- Rank by confidence: Present highest confidence hypothesis first (H1)
- Provide evidence both ways: List evidence for AND against each hypothesis
- Include falsification tests: Specify what would disprove each hypothesis
- Identify the most dangerous assumption: What assumption, if wrong, would most change the recommended approach?
Falsification Test Format:
**Falsification Test**: [Description of test]
- Command: `[specific command to run]`
- Expected if hypothesis is TRUE: [result]
- Expected if hypothesis is FALSE: [result]
Example:
### H1: OOM Kill due to memory leak — Confidence: High
**Falsification Test**: Check if memory usage grows over time
- Command: `kubectl top pod app-pod -n prod --containers`
- Expected if TRUE: Memory usage near limit, increasing trend
- Expected if FALSE: Memory usage stable, well below limit
Report Generation Checklist
Before finalizing a report, verify:
- Executive Triage Card can be understood in 30-60 seconds
- All facts are labeled and include source references
- All inferences include confidence levels and falsification conditions
- At least 2-3 hypotheses are presented with ranking
- "Most dangerous assumption" is explicitly identified
- Remediation steps include validation AND rollback procedures
- Proof of Work documents all commands executed
- Constraints and limitations are acknowledged
Automation First
Priority: Use automation scripts before manual command workflows. Scripts provide production-tested diagnostic flows and generate structured output.
Quick Reference: Task to Script Mapping
| Task | Script | Invocation |
|---|---|---|
| Production incident triage | incident_triage.sh |
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh --skip-dump |
| Incident triage (with cluster dump) | incident_triage.sh |
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh |
| Incident triage (namespace-scoped) | incident_triage.sh |
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh --namespace <NAMESPACE> |
| Cluster health check | cluster_health_check.sh |
~/.claude/skills/k8s-troubleshooter/scripts/cluster_health_check.sh |
| Cluster assessment report | cluster_assessment.sh |
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh |
| Cluster assessment (custom output) | cluster_assessment.sh |
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh -o custom-report.md |
| Cluster assessment (custom kubeconfig) | cluster_assessment.sh |
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh -c ~/.kube/prod-config |
| Pod diagnostics | pod_diagnostics.sh |
~/.claude/skills/k8s-troubleshooter/scripts/pod_diagnostics.sh <POD_NAME> <NAMESPACE> |
| Network debugging | network_debug.sh |
~/.claude/skills/k8s-troubleshooter/scripts/network_debug.sh <NAMESPACE> |
| Storage check | storage_check.sh |
~/.claude/skills/k8s-troubleshooter/scripts/storage_check.sh <NAMESPACE> |
| Helm release debug (basic) | helm_release_debug.sh |
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> |
| Helm release with chart validation | helm_release_debug.sh |
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> --chart <CHART_PATH> --values <VALUES_FILE> |
| Helm release with tests | helm_release_debug.sh |
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> --run-tests |
Script Location: All scripts are in ~/.claude/skills/k8s-troubleshooter/scripts/
Getting Help: Run any script with -h flag for usage details and parameters.
When to Use Manual Commands: Use manual workflows when:
- Script is not available for your specific task
- You need to understand the diagnostic methodology
- Script fails and you need to debug the underlying commands
Diagnostic Decision Tree
Start with the symptom that best matches your issue:
Issue Category → Entry Point → Workflow Phase
├─ Pod Issues
│ ├─ Not starting (Pending) → Pod Lifecycle → Baseline
│ ├─ Crashing (CrashLoop) → Pod Lifecycle → Inspect
│ ├─ Image pull failures → Pod Lifecycle → Correlate
│ └─ Resource issues (OOM) → Pod Lifecycle → Deep Dive
├─ Service/Network Issues
│ ├─ DNS resolution → Service Connectivity → Baseline
│ ├─ Endpoint mismatches → Service Connectivity → Inspect
│ ├─ Network policies → Network Debugging → Correlate
│ └─ Ingress/LoadBalancer → Service Connectivity → Deep Dive
├─ Storage Issues
│ ├─ PVC pending → Storage Diagnostics → Baseline
│ ├─ Mount failures → Storage Diagnostics → Inspect
│ ├─ CSI driver errors → Storage Diagnostics → Correlate
│ └─ Cloud provider issues → Storage Diagnostics → Deep Dive
├─ Node Issues
│ ├─ NotReady → Node Health → Baseline
│ ├─ Resource pressure → Node Health → Inspect
│ ├─ Kubelet issues → Node Health → Correlate
│ └─ CNI failures → Node Health → Deep Dive
├─ Helm Issues
│ ├─ Install/upgrade failures → Helm Debugging → Baseline
│ ├─ Stuck releases → Helm Debugging → Inspect
│ └─ Template errors → Helm Debugging → Correlate
└─ Cluster-Wide Issues
├─ API server problems → Cluster Health → Baseline
├─ Control plane → Cluster Health → Inspect
└─ Authentication/RBAC → Cluster Health → Correlate
Workflow 1: Pod Troubleshooting
Phase 1: Baseline Assessment
Objective: Determine pod current state and recent events
# Get pod status and basic info
kubectl get pod <POD_NAME> -n <NAMESPACE> -o wide
# Check recent events
kubectl get events -n <NAMESPACE> --field-selector involvedObject.name=<POD_NAME> --sort-by='.lastTimestamp'
# Get pod details
kubectl describe pod <POD_NAME> -n <NAMESPACE>
Common States:
Pending: Scheduling or resource issuesCrashLoopBackOff: Container repeatedly failingImagePullBackOff/ErrImagePull: Image retrieval problemsRunningbut unhealthy: Readiness/liveness probe failuresTerminatingstuck: Finalizers or graceful shutdown issues
Phase 2: Inspect Container Status
For Pending Pods:
# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check pod resource requests
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.spec.containers[*].resources}'
# Check for taints and tolerations
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.spec.tolerations}'
For CrashLoopBackOff:
# Get current logs
kubectl logs <POD_NAME> -n <NAMESPACE> -c <CONTAINER_NAME>
# Get previous container logs (after crash)
kubectl logs <POD_NAME> -n <NAMESPACE> -c <CONTAINER_NAME> --previous
# Check exit code and reason
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.status.containerStatuses[*].lastState.terminated}'
For Image Pull Failures:
# Verify image name and tag
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.spec.containers[*].image}'
# Check imagePullSecrets
kubectl get pod <POD_NAME> -n <NAMESPACE> -o jsonpath='{.spec.imagePullSecrets}'
# Test image pull on node
kubectl debug node/<NODE_NAME> -it --image=<TEST_IMAGE>
Phase 3: Correlate with Dependencies
# Check ConfigMaps and Secrets
kubectl get configmaps,secrets -n <NAMESPACE>
kubectl describe pod <POD_NAME> -n <NAMESPACE> | grep -A 5 "Environment"
# Check PVC bindings
kubectl get pvc -n <NAMESPACE>
kubectl describe pvc <PVC_NAME> -n <NAMESPACE>
# Check service account and RBAC
kubectl get serviceaccount <SA_NAME> -n <NAMESPACE>
kubectl auth can-i --list --as=system:serviceaccount:<NAMESPACE>:<SA_NAME>
Phase 4: Deep Dive (Advanced)
See references/pod-troubleshooting.md for:
- Container startup probe debugging
- Init container sequencing
- Resource quota and limit ranges
- Pod disruption budgets
- Security context issues
- Image pull through proxies
Stop Conditions: Pod running and passing readiness checks, or root cause identified.
Workflow 2: Service Connectivity
Phase 1: Baseline Assessment
# Check service definition
kubectl get svc <SERVICE_NAME> -n <NAMESPACE> -o wide
# Check endpoints
kubectl get endpoints <SERVICE_NAME> -n <NAMESPACE>
# Verify selector matches pods
kubectl get pods -n <NAMESPACE> -l <SELECTOR> --show-labels
Phase 2: DNS Verification
# Test DNS resolution from within cluster
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
nslookup <SERVICE_NAME>.<NAMESPACE>.svc.cluster.local
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
Phase 3: Endpoint Investigation
# Detailed endpoint info
kubectl describe endpoints <SERVICE_NAME> -n <NAMESPACE>
# Check pod readiness
kubectl get pods -n <NAMESPACE> -l <SELECTOR> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
# Test direct pod connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
curl <POD_IP>:<PORT>
Phase 4: Network Policy Check
# List network policies affecting namespace
kubectl get networkpolicies -n <NAMESPACE>
# Describe relevant policies
kubectl describe networkpolicy <POLICY_NAME> -n <NAMESPACE>
# Check pod labels against policy selectors
kubectl get pods -n <NAMESPACE> --show-labels
Deep Dive: See references/service-networking.md for ingress troubleshooting, LoadBalancer issues, ExternalDNS, and service mesh integration.
Workflow 3: Storage Troubleshooting
Phase 1: PVC Status Check
# Check PVC status
kubectl get pvc -n <NAMESPACE>
# Detailed PVC info
kubectl describe pvc <PVC_NAME> -n <NAMESPACE>
# Check PV binding
kubectl get pv
kubectl describe pv <PV_NAME>
Phase 2: StorageClass and Provisioner
# Check StorageClass
kubectl get storageclass
kubectl describe storageclass <SC_NAME>
# Check CSI driver pods
kubectl get pods -n kube-system | grep csi
# Check CSI controller logs
kubectl logs -n kube-system <CSI_CONTROLLER_POD> -c csi-provisioner
Phase 3: Volume Attachment
# Check volume attachments
kubectl get volumeattachments
# Describe specific attachment
kubectl describe volumeattachment <ATTACHMENT_NAME>
# Check node volume health
kubectl describe node <NODE_NAME> | grep -A 10 "Volumes"
Deep Dive: See references/storage-csi.md for cloud-specific CSI troubleshooting (EBS, Azure Disk, GCE PD), mount options, and performance issues.
Workflow 4: Node Health
Phase 1: Node Status
# Check node status
kubectl get nodes -o wide
# Check node conditions
kubectl describe node <NODE_NAME> | grep -A 10 "Conditions"
# Check resource pressure
kubectl top nodes
kubectl describe node <NODE_NAME> | grep -A 5 "Allocated resources"
Phase 2: Kubelet and System Services
# Check kubelet logs (node access required)
kubectl debug node/<NODE_NAME> -it --image=ubuntu -- \
chroot /host journalctl -u kubelet -n 100
# Check container runtime
kubectl debug node/<NODE_NAME> -it --image=ubuntu -- \
chroot /host systemctl status containerd
# Check CNI health
kubectl debug node/<NODE_NAME> -it --image=ubuntu -- \
chroot /host ls -la /etc/cni/net.d/
Phase 3: Pod Distribution and Eviction
# List pods on node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<NODE_NAME>
# Check for evicted pods
kubectl get pods --all-namespaces --field-selector status.phase=Failed
# Check pod disruption budgets
kubectl get pdb --all-namespaces
Deep Dive: See references/calico-cni.md for Calico-specific troubleshooting.
Workflow 5: Helm Debugging
AUTOMATED SCRIPT AVAILABLE: Use
helm_release_debug.shfor comprehensive Helm release diagnostics.
Automated Helm Debug (Primary Method)
# Basic release diagnostics
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE>
# Include chart validation (lint, template, dry-run)
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> \
--chart <CHART_PATH> --values <VALUES_FILE>
# Run post-install tests with the release
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> --run-tests
# Full validation: chart lint, template, dry-run, and tests
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> \
--chart <CHART_PATH> --values <VALUES_FILE> --run-tests --run-dry-run
# Compare current release with proposed changes
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> \
--chart <CHART_PATH> --values <VALUES_FILE> --diff
# Custom output format (json, yaml, table)
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> --output json
The script automatically provides:
Core Diagnostics (always included):
- Release status, history, and current values
- Deployed resource status and health
- Hook execution details (jobs, pods, logs)
- Recent events and error messages
- Release state clarity (last_error reporting, empty manifest detection)
- Helm secret validation
- Pod status with standard Helm labels
Hook-Aware Diagnostics:
- Lists all hooks defined in the release (
helm get hooks) - Shows hook job and pod details
- Displays hook execution logs
- Identifies failed hook phases (pre-install, post-install, pre-upgrade, post-upgrade, pre-delete, post-delete)
- Reports aged-out events for hooks
Optional Chart Validation (with --chart flag):
- Chart linting (
helm lint) with detailed error reporting - Template rendering (
helm template) to catch syntax errors - Dry-run installation/upgrade (
helm upgrade --dry-run --debug) with--run-dry-runflag - Value file validation and override testing with
--set*flags - Diff comparison between current and proposed release with
--diffflag
Optional Post-Install Tests (with --run-tests flag):
- Executes
helm testfor the release - Provides failure summaries with test pod details
- Shows test logs for failed tests
- Reports test execution timeout issues
When to Use Each Flag
Basic Diagnostics (no flags):
- Release is stuck or failed
- Need to understand current release state
- Checking deployed resource health
- Initial troubleshooting phase
--chart <PATH> --values <FILE> (Chart Validation):
- Before performing an upgrade
- Template rendering errors suspected
- Validating chart syntax changes
- Testing value overrides
- Use when you have access to chart files
--run-dry-run (with --chart):
- Want to see what would change without applying
- Testing complex upgrades
- Validating cluster-side mutations (admission webhooks, mutating webhooks)
- Checking API compatibility
--diff (with --chart):
- Need to see exact changes between current and proposed release
- Reviewing impact of value changes
- Understanding configuration drift
- Pre-upgrade review
--run-tests:
- Release deployed but functionality uncertain
- Post-upgrade validation
- Smoke testing after remediation
- Continuous validation in CI/CD
--set* flags (--set, --set-string, --set-file):
- Quick value overrides without modifying values file
- Testing specific configuration changes
- Override single values for validation
--output <format> (json, yaml, table):
- Integrating with automation or CI/CD
- Parsing output programmatically
- Generating structured reports
Usage Examples
Scenario 1: Failed Helm Upgrade
# Start with basic diagnostics
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh myapp production
# If template issues suspected, validate chart
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh myapp production \
--chart ./charts/myapp --values values-prod.yaml
Scenario 2: Release Stuck in Pending-Upgrade
# Check release state and hooks
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh myapp production
# Review what would happen on retry with dry-run
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh myapp production \
--chart ./charts/myapp --values values-prod.yaml --run-dry-run
Scenario 3: Post-Upgrade Validation
# Run tests to verify functionality
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh myapp production --run-tests
# Check difference from previous version
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh myapp production \
--chart ./charts/myapp --values values-prod.yaml --diff
Scenario 4: Pre-Upgrade Review
# Full validation before upgrade
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh myapp production \
--chart ./charts/myapp-v2 --values values-prod.yaml \
--run-dry-run --diff --run-tests
Scenario 5: Hook Failure Investigation
# Script automatically shows hook details including:
# - Hook definitions from release
# - Hook job/pod status
# - Hook execution logs
# - Failed hook phases
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh myapp production
Manual Fallback (when script unavailable)
Phase 1: Release Status
# List releases
helm list -n <NAMESPACE>
# Get release status
helm status <RELEASE_NAME> -n <NAMESPACE>
# Check release history
helm history <RELEASE_NAME> -n <NAMESPACE>
# Check for release errors
helm status <RELEASE_NAME> -n <NAMESPACE> -o json | jq '.info.status, .info.description'
Phase 2: Template Validation
# Lint chart
helm lint <CHART_PATH>
# Render templates without installing
helm template <RELEASE_NAME> <CHART_PATH> -n <NAMESPACE> --values <VALUES_FILE>
# Dry-run install/upgrade
helm upgrade --install <RELEASE_NAME> <CHART_PATH> -n <NAMESPACE> --dry-run --debug
# Show diff between releases
helm diff upgrade <RELEASE_NAME> <CHART_PATH> -n <NAMESPACE> --values <VALUES_FILE>
Phase 3: Hook Diagnostics
# Get hooks for release
helm get hooks <RELEASE_NAME> -n <NAMESPACE>
# Check hook job status
kubectl get jobs -n <NAMESPACE> -l app.kubernetes.io/managed-by=Helm
# Get hook pod logs
kubectl logs -n <NAMESPACE> -l app.kubernetes.io/managed-by=Helm --tail=100
Phase 4: Stuck Release Recovery
# Check for pending upgrades
helm list -n <NAMESPACE> --pending
# Check release secrets
kubectl get secrets -n <NAMESPACE> -l owner=helm
# Check deployed resources
helm get manifest <RELEASE_NAME> -n <NAMESPACE> | kubectl get -f -
# Run post-install tests
helm test <RELEASE_NAME> -n <NAMESPACE>
Release State Clarity
The script provides clear reporting on:
Empty Manifest Detection:
- Warns when release has no deployed resources
- Helps identify chart configuration issues
- Indicates potential template rendering problems
Last Error Reporting:
- Shows error from most recent failed operation
- Includes error messages from Helm's perspective
- Helps diagnose why releases are stuck
Aged-Out Events:
- Identifies when events have expired (default 1 hour TTL)
- Prevents confusion from missing event data
- Recommends checking Helm release history instead
Hook State Tracking:
- Shows which hooks succeeded/failed
- Reports hook execution order
- Identifies hanging or incomplete hooks
Deep Dive: See references/helm-debugging.md for upgrade failures, rollback procedures, and secret/state cleanup.
Workflow 6: Cluster Health
AUTOMATED SCRIPT AVAILABLE: Use
cluster_health_check.shfor quick baseline health checks.
Automated Health Check (Primary Method)
# Run automated cluster health check
~/.claude/skills/k8s-troubleshooter/scripts/cluster_health_check.sh
This script automatically checks:
- Control plane health (API server, components)
- Node status and resource pressure
- System pod health (kube-system namespace)
- Recent cluster events and errors
- Failed or pending pods
Use this first for fast, consistent health checks.
Manual Fallback (for understanding or when script unavailable)
Phase 1: Control Plane Check
# Check control plane components
kubectl get componentstatuses # Deprecated in 1.19+, use below
# Check control plane pods
kubectl get pods -n kube-system
# Check API server health
kubectl get --raw /healthz
kubectl get --raw /livez
kubectl get --raw /readyz
Phase 2: Cluster Resource Overview
# Node summary
kubectl get nodes
# Cluster resource usage
kubectl top nodes
kubectl top pods --all-namespaces | head -20
# Check for resource quotas
kubectl get resourcequotas --all-namespaces
Phase 3: System Pod Health
# Check critical system pods
kubectl get pods -n kube-system -o wide
# Check for recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50
# Check for failed pods cluster-wide
kubectl get pods --all-namespaces --field-selector status.phase=Failed
Workflow 7: Cluster Assessment
AUTOMATED SCRIPT AVAILABLE: Use
cluster_assessment.shto generate comprehensive assessment reports.
Overview
Generate a comprehensive, documented cluster assessment report for audits, capacity planning, and documentation. Unlike the quick health check above, this produces a detailed markdown report with analysis and recommendations.
When to Use
- Initial cluster evaluation: Baseline assessment of new clusters
- Capacity planning: Understanding resource utilization and growth
- Audit and compliance: Documentation for security reviews
- Quarterly reviews: Regular operational health assessments
- Handoff documentation: Transferring cluster ownership
Automated Assessment (Primary Method)
# Generate report with default name (cluster-assessment-TIMESTAMP.md)
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh
# Specify output file
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh -o my-cluster-report.md
# Use specific kubeconfig
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh -c ~/.kube/prod-config -o prod-report.md
The script automatically:
- Collects comprehensive cluster data (control plane, nodes, workloads, storage, networking)
- Analyzes resource overcommitment, failed workloads, node pressure, security posture
- Generates structured markdown report with executive summary and health score
- Provides prioritized recommendations (High/Medium/Low) with specific action items
Use this first for consistent, thorough assessments with actionable recommendations.
Manual Fallback (for understanding methodology)
Phase 1: Data Collection
# Control plane health
kubectl get --raw /healthz
kubectl get --raw /readyz
kubectl get --raw /livez
# Comprehensive node data
kubectl get nodes -o wide
kubectl describe nodes
kubectl top nodes
# All workloads
kubectl get pods,deployments,statefulsets,daemonsets --all-namespaces
# Storage infrastructure
kubectl get pvc,pv,storageclass --all-namespaces
# Networking
kubectl get svc,endpoints,networkpolicies --all-namespaces
# Recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
Phase 2: Analysis
The assessment analyzes:
- Resource overcommitment (CPU/Memory limits vs capacity)
- Failed or pending workloads
- Node pressure conditions
- Security posture (network policies, RBAC, authentication)
- Storage capacity and health
- Platform component status
Phase 3: Report Generation
Generates structured markdown report with:
- Executive summary with health score
- Detailed analysis of all cluster aspects
- Prioritized recommendations (High/Medium/Low)
- Comparison against best practices
Phase 4: Recommendations
Each finding includes:
- Problem statement
- Impact assessment
- Specific action items
- Documentation references
Report Sections
- Executive Summary - Health score, critical findings
- Control Plane Health - API server, components
- Node Infrastructure - Status, capacity, conditions
- Resource Allocation - Overcommitment analysis
- Workload Status - Pods, deployments, health
- Storage Infrastructure - PVC/PV, storage classes
- Network Configuration - CNI, services, policies
- Security Posture - RBAC, authentication, policies
- Recent Events - Warnings, errors, alerts
- Recommendations - Prioritized action items
Comparison: Health Check vs Assessment
| Feature | Health Check | Cluster Assessment |
|---|---|---|
| Speed | 30 seconds | 2-5 minutes |
| Output | Terminal | Markdown report |
| Scope | Critical issues | Full analysis |
| Recommendations | None | Prioritized |
| Use Case | Quick status | Documentation |
Deep Dive: See references/cluster-assessment.md for detailed assessment methodology, automation, and best practices.
Network Debugging Workflow
CNI and Connectivity
# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=calico-node # For Calico
kubectl get pods -n kube-system | grep cni
# Test pod-to-pod connectivity
kubectl run -it --rm netshoot-1 --image=nicolaka/netshoot --restart=Never -- ping <TARGET_POD_IP>
# Check network policies
kubectl get networkpolicies --all-namespaces
# Test service connectivity
kubectl run -it --rm netshoot-2 --image=nicolaka/netshoot --restart=Never -- \
curl <SERVICE_NAME>.<NAMESPACE>.svc.cluster.local:<PORT>
Deep Dive: See references/calico-cni.md for Calico-specific debugging and references/service-networking.md for advanced networking.
Incident Response Playbooks
For structured incident response workflows, see references/incident-response.md:
- Incident Triage Decision Tree: Symptom-based classification and workflow selection
- Evidence Preservation: Best practices for capturing cluster state
- Investigation Workflows by Symptom: Detailed steps for each common failure pattern
- Pods Pending: Scheduling and resource constraints
- CrashLoopBackOff: Application crash recovery
- OOMKilled: Memory pressure investigation
- DNS/Network Issues: CoreDNS and connectivity troubleshooting
- Storage Failures: PVC and CSI driver diagnostics
- Node Problems: NotReady nodes and resource pressure
- Stabilization Techniques: Safe remediation actions for production
- Common Incident Patterns: Real-world scenarios and responses
- Post-Incident Review: Documentation and improvement process
For additional common scenarios, see references/incident-playbooks.md:
- ImagePullBackOff: Registry access and authentication
- Stuck Terminating: Finalizer and graceful shutdown issues
Production Safety Guidelines
Read-Only Commands (Safe)
kubectl get,describe,logs,topkubectl auth can-i --list(RBAC check)kubectl debug node/<NODE>with read-only operationshelm list,status,history,get
Remediation Commands (Require Explicit Approval)
kubectl delete,apply,edit,patchkubectl scale,rollout restartkubectl drain,cordon,uncordonhelm upgrade,rollback,uninstall
Always confirm before running remediation commands in production.
MCP Integration
For automated kubectl access, see references/mcp-integration.md for:
- MCP server setup with read-only access
- RBAC configuration for diagnostic service accounts
- Security and token hygiene
- Integration patterns with Claude
Scripts Reference
Use these scripts first before running manual command sequences. Scripts provide production-tested workflows with consistent output formatting.
Script Location: ~/.claude/skills/k8s-troubleshooter/scripts/
Getting Help: All scripts support -h flag for detailed usage information and parameter descriptions.
Available Scripts
Incident Triage (incident_triage.sh) [NEW - USE FIRST FOR INCIDENTS]
- Production incident response and triage workflow
- Captures evidence, assesses blast radius, classifies symptoms, recommends workflows
- Output: Markdown report with executive summary and captured evidence
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh --skip-dump
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh --namespace production
~/.claude/skills/k8s-troubleshooter/scripts/incident_triage.sh --output-dir /tmp/incident-123
Cluster Health Check (cluster_health_check.sh)
- Automated baseline cluster health check
- Checks: control plane, nodes, system pods, recent events
~/.claude/skills/k8s-troubleshooter/scripts/cluster_health_check.sh
Cluster Assessment (cluster_assessment.sh)
- Generate comprehensive assessment report with recommendations
- Output: Markdown report with executive summary and action items
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh -o custom-report.md
~/.claude/skills/k8s-troubleshooter/scripts/cluster_assessment.sh -c ~/.kube/prod-config
Pod Diagnostics (pod_diagnostics.sh)
- Comprehensive pod state analysis and debugging
- Includes: status, events, logs, resource usage, restart history
~/.claude/skills/k8s-troubleshooter/scripts/pod_diagnostics.sh <POD_NAME> <NAMESPACE>
Network Debug (network_debug.sh)
- DNS, endpoints, and connectivity testing
- Tests: DNS resolution, service endpoints, network policies, pod-to-pod connectivity
~/.claude/skills/k8s-troubleshooter/scripts/network_debug.sh <NAMESPACE>
Storage Check (storage_check.sh)
- PVC/PV and CSI driver diagnostics
- Checks: PVC status, bindings, provisioner health, volume attachments
~/.claude/skills/k8s-troubleshooter/scripts/storage_check.sh <NAMESPACE>
Helm Release Debug (helm_release_debug.sh)
- Comprehensive Helm release diagnostics and troubleshooting
- Core: release status, history, deployed resources, hook diagnostics
- Optional: chart validation (lint, template, dry-run), post-install tests, diff comparison
- Flags:
--chart,--values,--set*,--run-tests,--run-dry-run,--diff,--output
# Basic diagnostics
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE>
# With chart validation
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> \
--chart <CHART_PATH> --values <VALUES_FILE>
# Full validation with tests and dry-run
~/.claude/skills/k8s-troubleshooter/scripts/helm_release_debug.sh <RELEASE_NAME> <NAMESPACE> \
--chart <CHART_PATH> --values <VALUES_FILE> --run-tests --run-dry-run --diff
Additional Resources
References:
references/incident-response.md: [NEW] Incident triage decision tree and response workflowsreferences/pod-troubleshooting.md: Pod lifecycle deep divereferences/service-networking.md: Service and ingress troubleshootingreferences/storage-csi.md: Storage and CSI driver diagnosticsreferences/helm-debugging.md: Helm operations and recoveryreferences/calico-cni.md: Calico CNI troubleshootingreferences/incident-playbooks.md: Common failure scenariosreferences/mcp-integration.md: MCP server integration
External Documentation:
Workflow Summary
- Identify Symptom: Use decision tree to find entry point
- Run Baseline: Execute Phase 1 diagnostic commands
- Inspect Details: Phase 2 deep dive into specific component
- Correlate Events: Phase 3 check dependencies and relationships
- Deep Dive: Load relevant reference for advanced diagnostics
- Identify Root Cause: Stop when cause is clear
- Propose Remediation: Suggest fixes (requires approval)
- Verify Resolution: Re-run baseline to confirm fix
Remember: The goal is to identify root causes, not just symptoms. Follow the phases systematically and use references for domain-specific deep dives.