investigate-pod-failure

X-McKay's avatarfrom X-McKay

Deep investigation of a failing pod. Gathers logs, events, and resource status to identify root cause. Keywords: investigate, debug, pod failure, troubleshoot, root cause, logs, events.

1stars🔀0forks📁View on GitHub🕐Updated Jan 11, 2026

When & Why to Use This Skill

This Claude skill automates the deep investigation of failing Kubernetes pods by aggregating logs, events, and resource statuses. It streamlines the troubleshooting process for SREs and DevOps engineers, enabling rapid root cause identification for common issues like OOM kills, configuration errors, and connection failures within a cluster.

Use Cases

  • Rapidly diagnosing 'CrashLoopBackOff' or 'Error' states in production namespaces to minimize service downtime.
  • Identifying the root cause of pod failures by correlating container logs with cluster-level events and resource specifications.
  • Troubleshooting failed deployments where pods are created but fail to reach a 'Running' state due to application errors.
  • Investigating resource-related terminations, such as OOM (Out of Memory) kills, through detailed pod status and event history inspection.
nameinvestigate-pod-failure
description>
status to identify root cause. Keywordsinvestigate, debug, pod failure,
domaink8s
categorydiagnostic
requires-approvalfalse
confidence0.95

Investigate Pod Failure

Preconditions

Before applying this skill, verify:

  • Pod is in a failed or error state
  • Pod has been created (not stuck in Pending)

Actions

1. Get Pod Status and Details

Retrieve the full pod specification and current status.

mcp_tool: kubernetes-mcp-server/pods_get
params:
  name: $pod_name
  namespace: $namespace
timeout: 30s

2. Get Pod Logs

Retrieve recent logs from the pod container.

mcp_tool: kubernetes-mcp-server/pods_log
params:
  name: $pod_name
  namespace: $namespace
  tail: 100
timeout: 30s

3. Get Recent Events

Retrieve recent events related to the pod.

mcp_tool: kubernetes-mcp-server/events_list
params:
  namespace: $namespace
timeout: 30s

Success Criteria

The skill succeeds when:

  • Root cause identified from logs or events
  • Investigation produces actionable findings

Failure Handling

If investigation is inconclusive:

  1. Check previous container logs (--previous flag)
  2. Describe the pod for more context
  3. Check node events if pod failed to schedule
  4. Escalate with all gathered data

Examples

Input Context:

{
  "pod_name": "backend-api-abc123",
  "namespace": "production",
  "status": "Error",
  "exit_code": 1
}

Expected Outcome: Investigation reveals root cause (e.g., missing config, connection refused, OOM killed) with specific actionable recommendations.