restart-crashloop

X-McKay's avatarfrom X-McKay

Restart a pod stuck in CrashLoopBackOff. Use when pod has crashed 3+ times and a restart might resolve transient issues. Keywords: crashloop, restart, pod failure, container crash, pod stuck, pod crashing.

1stars🔀0forks📁View on GitHub🕐Updated Jan 11, 2026

When & Why to Use This Skill

This Claude skill automates the remediation of Kubernetes pods stuck in a CrashLoopBackOff state. It provides a safe, programmatic way to restart failing containers by deleting the affected pod, allowing the deployment controller to trigger a fresh recreation. By validating preconditions like restart counts and pod types, it ensures that automated restarts are only applied to appropriate workloads, reducing manual SRE intervention and improving service recovery times.

Use Cases

  • Transient Issue Resolution: Automatically restarting microservices that fail to initialize due to temporary network timeouts or external dependency unavailability.
  • SRE Runbook Automation: Serving as an automated step in an incident response workflow to handle 'stuck' pods without requiring manual kubectl commands.
  • Environment Maintenance: Quickly clearing failing pods in development or staging environments to ensure resource availability and clean state transitions.
  • First-Line Defense: Acting as a primary automated response for non-critical pod failures before escalating to human engineers.
namerestart-crashloop
description>
and a restart might resolve transient issues. Keywordscrashloop, restart,
domaink8s
categoryremediation
requires-approvalfalse
confidence0.85

Restart CrashLoopBackOff Pod

Preconditions

Before applying this skill, verify:

  • Pod status is CrashLoopBackOff
  • Pod has restarted more than 3 times
  • Pod is NOT part of a Job or CronJob
  • No OOMKilled events in last 10 minutes

Actions

1. Delete Pod to Trigger Recreation

Use the kubernetes-mcp-server to delete the pod. The deployment controller will automatically create a replacement pod.

mcp_tool: kubernetes-mcp-server/pods_delete
params:
  name: $pod_name
  namespace: $namespace
timeout: 30s

Success Criteria

The skill succeeds when:

  • New pod created within 30 seconds
  • New pod reaches Running state within 2 minutes
  • No CrashLoopBackOff within 5 minutes of restart

Failure Handling

If the pod does not reach Running state:

  1. Check events for the new pod
  2. Check logs from the new pod
  3. Escalate to human if pattern repeats 3 times

Examples

Input Context:

{
  "pod_name": "nginx-deployment-abc123",
  "namespace": "default",
  "restart_count": 5,
  "status": "CrashLoopBackOff"
}

Expected Outcome: Pod deleted, new pod reaches Running within 2 minutes.

restart-crashloop – AI Agent Skills | Claude Skills