restart-crashloop
Restart a pod stuck in CrashLoopBackOff. Use when pod has crashed 3+ times and a restart might resolve transient issues. Keywords: crashloop, restart, pod failure, container crash, pod stuck, pod crashing.
When & Why to Use This Skill
This Claude skill automates the remediation of Kubernetes pods stuck in a CrashLoopBackOff state. It provides a safe, programmatic way to restart failing containers by deleting the affected pod, allowing the deployment controller to trigger a fresh recreation. By validating preconditions like restart counts and pod types, it ensures that automated restarts are only applied to appropriate workloads, reducing manual SRE intervention and improving service recovery times.
Use Cases
- Transient Issue Resolution: Automatically restarting microservices that fail to initialize due to temporary network timeouts or external dependency unavailability.
- SRE Runbook Automation: Serving as an automated step in an incident response workflow to handle 'stuck' pods without requiring manual kubectl commands.
- Environment Maintenance: Quickly clearing failing pods in development or staging environments to ensure resource availability and clean state transitions.
- First-Line Defense: Acting as a primary automated response for non-critical pod failures before escalating to human engineers.
| name | restart-crashloop |
|---|---|
| description | > |
| and a restart might resolve transient issues. Keywords | crashloop, restart, |
| domain | k8s |
| category | remediation |
| requires-approval | false |
| confidence | 0.85 |
Restart CrashLoopBackOff Pod
Preconditions
Before applying this skill, verify:
- Pod status is CrashLoopBackOff
- Pod has restarted more than 3 times
- Pod is NOT part of a Job or CronJob
- No OOMKilled events in last 10 minutes
Actions
1. Delete Pod to Trigger Recreation
Use the kubernetes-mcp-server to delete the pod. The deployment controller will automatically create a replacement pod.
mcp_tool: kubernetes-mcp-server/pods_delete
params:
name: $pod_name
namespace: $namespace
timeout: 30s
Success Criteria
The skill succeeds when:
- New pod created within 30 seconds
- New pod reaches Running state within 2 minutes
- No CrashLoopBackOff within 5 minutes of restart
Failure Handling
If the pod does not reach Running state:
- Check events for the new pod
- Check logs from the new pod
- Escalate to human if pattern repeats 3 times
Examples
Input Context:
{
"pod_name": "nginx-deployment-abc123",
"namespace": "default",
"restart_count": 5,
"status": "CrashLoopBackOff"
}
Expected Outcome: Pod deleted, new pod reaches Running within 2 minutes.