investigate-pod-failure
Deep investigation of a failing pod. Gathers logs, events, and resource status to identify root cause. Keywords: investigate, debug, pod failure, troubleshoot, root cause, logs, events.
When & Why to Use This Skill
This Claude skill automates the deep investigation of failing Kubernetes pods by aggregating logs, events, and resource statuses. It streamlines the troubleshooting process for SREs and DevOps engineers, enabling rapid root cause identification for common issues like OOM kills, configuration errors, and connection failures within a cluster.
Use Cases
- Rapidly diagnosing 'CrashLoopBackOff' or 'Error' states in production namespaces to minimize service downtime.
- Identifying the root cause of pod failures by correlating container logs with cluster-level events and resource specifications.
- Troubleshooting failed deployments where pods are created but fail to reach a 'Running' state due to application errors.
- Investigating resource-related terminations, such as OOM (Out of Memory) kills, through detailed pod status and event history inspection.
| name | investigate-pod-failure |
|---|---|
| description | > |
| status to identify root cause. Keywords | investigate, debug, pod failure, |
| domain | k8s |
| category | diagnostic |
| requires-approval | false |
| confidence | 0.95 |
Investigate Pod Failure
Preconditions
Before applying this skill, verify:
- Pod is in a failed or error state
- Pod has been created (not stuck in Pending)
Actions
1. Get Pod Status and Details
Retrieve the full pod specification and current status.
mcp_tool: kubernetes-mcp-server/pods_get
params:
name: $pod_name
namespace: $namespace
timeout: 30s
2. Get Pod Logs
Retrieve recent logs from the pod container.
mcp_tool: kubernetes-mcp-server/pods_log
params:
name: $pod_name
namespace: $namespace
tail: 100
timeout: 30s
3. Get Recent Events
Retrieve recent events related to the pod.
mcp_tool: kubernetes-mcp-server/events_list
params:
namespace: $namespace
timeout: 30s
Success Criteria
The skill succeeds when:
- Root cause identified from logs or events
- Investigation produces actionable findings
Failure Handling
If investigation is inconclusive:
- Check previous container logs (--previous flag)
- Describe the pod for more context
- Check node events if pod failed to schedule
- Escalate with all gathered data
Examples
Input Context:
{
"pod_name": "backend-api-abc123",
"namespace": "production",
"status": "Error",
"exit_code": 1
}
Expected Outcome: Investigation reveals root cause (e.g., missing config, connection refused, OOM killed) with specific actionable recommendations.