k8s-troubleshoot

mhalder's avatarfrom mhalder

Debug Kubernetes pods, services, and cluster issues. Use when the user says "pod not starting", "CrashLoopBackOff", "service not reachable", "kubectl debug", "pod stuck pending", or asks about Kubernetes problems.

0stars🔀0forks📁View on GitHub🕐Updated Jan 5, 2026

When & Why to Use This Skill

This Claude skill acts as an automated Kubernetes SRE assistant designed to rapidly diagnose and resolve cluster-level issues. By leveraging kubectl integration, it streamlines the troubleshooting workflow for pods, services, and deployments, identifying root causes such as resource constraints, networking misconfigurations, and container lifecycle errors to minimize downtime.

Use Cases

  • Diagnosing Pod startup failures: Identifying causes for 'CrashLoopBackOff', 'ImagePullBackOff', or 'Pending' states by analyzing events and logs.
  • Troubleshooting Service connectivity: Verifying endpoint mapping, selector matches, and ingress configurations when services are unreachable.
  • Cluster Health Monitoring: Checking node pressure, resource allocation (CPU/Memory), and taints/tolerations that prevent workload scheduling.
  • Rapid Incident Remediation: Providing step-by-step instructions to fix configuration errors in ConfigMaps, Secrets, or Deployment manifests based on real-time diagnostic data.
namek8s-troubleshoot
descriptionDebug Kubernetes pods, services, and cluster issues. Use when the user says "pod not starting", "CrashLoopBackOff", "service not reachable", "kubectl debug", "pod stuck pending", or asks about Kubernetes problems.
allowed-toolsBash, Read, Grep

Kubernetes Troubleshoot

Debug pods, services, deployments, and networking issues in Kubernetes.

Instructions

  1. Identify the affected resource (pod, service, deployment)
  2. Get current state with kubectl get and kubectl describe
  3. Check logs if applicable
  4. Diagnose based on status/events
  5. Provide specific remediation steps

Diagnostic commands

# Pod debugging
kubectl get pods -o wide
kubectl describe pod <pod>
kubectl logs <pod> [--previous] [-c container]
kubectl get events --sort-by=.lastTimestamp

# Service/networking
kubectl get svc,endpoints
kubectl describe svc <service>
kubectl get ingress

# Resource issues
kubectl top pods
kubectl describe node <node> | grep -A5 "Allocated resources"

# Debug pod (ephemeral container)
kubectl debug -it <pod> --image=busybox --target=<container>

Common issues

Status Cause Solution
Pending No resources Check node capacity, resource requests
Pending No matching node Check nodeSelector, taints/tolerations
ImagePullBackOff Bad image/auth Verify image name, imagePullSecrets
CrashLoopBackOff App crashing Check logs, entrypoint, health probes
CreateContainerConfigError Bad configmap/secret Verify referenced configs exist
Evicted Node pressure Check node conditions, resource limits

Service not reachable checklist

  1. Pod running? kubectl get pods -l app=<app>
  2. Pod ready? Check readiness probe
  3. Endpoints exist? kubectl get endpoints <svc>
  4. Service selector matches pod labels?
  5. Port/targetPort correct?
  6. NetworkPolicy blocking traffic?

Rules

  • MUST check events with kubectl describe before diagnosing
  • MUST check logs for CrashLoopBackOff
  • Never delete pods/resources without user approval
  • Never apply changes without showing the diff first
  • Always specify namespace if not default: -n <namespace>