troubleshoot

sagaragas's avatarfrom sagaragas

Diagnose and fix issues with the Kubernetes cluster, Flux reconciliation, or deployed applications

1stars🔀0forks📁View on GitHub🕐Updated Jan 11, 2026

When & Why to Use This Skill

This Claude skill provides comprehensive diagnostic and troubleshooting capabilities for Kubernetes clusters, specifically optimized for Flux CD reconciliation and k3s environments. It automates the process of identifying pod failures, network bottlenecks, and certificate issues, enabling rapid incident resolution and system recovery for SREs and DevOps engineers.

Use Cases

  • Automating the diagnosis of Flux CD reconciliation failures and forcing HelmRelease updates to restore service.
  • Troubleshooting Kubernetes pod crashes, restarts, and image pull errors with detailed log and event analysis.
  • Performing network connectivity tests and CNI status checks using Cilium diagnostics to resolve routing issues.
  • Identifying and resolving SSL/TLS certificate issues, including failed certificate requests and expiration problems.
  • Monitoring Talos node health and system-level dmesg logs to debug underlying infrastructure and hardware issues.
nametroubleshoot
descriptionDiagnose and fix issues with the Kubernetes cluster, Flux reconciliation, or deployed applications

Troubleshoot Skill

Diagnose and resolve issues in the k3s homelab cluster.

When to Use

  • User reports application not working
  • Flux reconciliation failures
  • Pod crashes or restarts
  • Network connectivity issues
  • Certificate problems

Diagnostic Commands

Check Flux Status

flux get ks -A                    # Kustomization status
flux get hr -A                    # HelmRelease status
flux get sources all -A           # All sources (git, helm, oci)
flux logs --follow                # Flux controller logs

Check Pod Status

kubectl get pods -A | grep -v Running
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous

Check Events

kubectl get events -A --sort-by='.lastTimestamp' | tail -20
kubectl get events -n <namespace> --field-selector type=Warning

Network Diagnostics

cilium status                     # CNI status
cilium connectivity test          # Network connectivity
kubectl get svc -A                # Services
kubectl get httproute -A          # Ingress routes

Certificate Issues

kubectl get certificates -A
kubectl get certificaterequests -A
kubectl describe certificate <name> -n <namespace>

Common Issues & Fixes

HelmRelease Stuck

flux suspend hr <name> -n <namespace>
flux resume hr <name> -n <namespace>
# Or force reconcile:
flux reconcile hr <name> -n <namespace> --force

Image Pull Errors

  • Check if image exists and tag is correct
  • Verify imagePullSecrets if private registry
  • Check Spegel for cached images: kubectl logs -n kube-system -l app.kubernetes.io/name=spegel

SOPS Decryption Failures

kubectl get secret -n flux-system sops-age
flux logs --kind=Kustomization --name=<ks-name>

Node Issues (Talos)

talosctl -n <node-ip> health
talosctl -n <node-ip> dmesg | tail -50
talosctl -n <node-ip> services