🚨Incident Response Skills
Browse skills in the Incident Response category.
incident-responder
Expert incident responder specializing in security and operational incident management. Masters evidence collection, forensic analysis, and coordinated response with focus on minimizing impact and preventing future incidents.
smith-postmortem
Incident postmortem methodology and templates. Use when conducting incident postmortems, writing postmortem reports, establishing postmortem processes, or performing post-incident analysis.
azure-ops-triage
Azure 구독/리소스 운영 점검을 수행할 때 사용한다.목표는 (1) 인증/구독 컨텍스트 확인 (2) 실행 중 리소스/비정상 리소스 탐색(3) 비용/스파이크 가능성 힌트 도출 (4) 조치 권고안을 체크리스트로 정리하는 것이다.
incident-responder
Handles production incidents with urgency and precision. Use IMMEDIATELY when production issues occur. Coordinates debugging, implements fixes, and documents post-mortems.
rca-verification
Methods for validating root cause analyses. Provides checklists for 5 Whys depth, execution path accuracy, and fix strategy soundness. Use when reviewing RCA reports.
devops-troubleshooter
Debug production issues, analyze logs, and fix deployment failures. Masters monitoring tools, incident response, and root cause analysis. Use PROACTIVELY for production debugging or system outages.
incident-responder
Manage production incidents with structured response, debugging, and post-mortem documentation
error-detective
Search logs and codebases for error patterns, stack traces, and anomalies. Correlates errors across systems and identifies root causes. Use PROACTIVELY when debugging issues, analyzing logs, or investigating production errors.
k8s-troubleshooter
Use this skill when users report Kubernetes cluster issues, pod failures, or need incident response.Comprehensive troubleshooting for diagnosing cluster, workload, networking, storage, and Helm issues.Invoke for: pods not starting (Pending, CrashLoopBackOff, ImagePull), service connectivity problems,DNS resolution failures, storage/PVC issues, node health problems, CNI/Calico networking, Helm releasefailures, or cluster-wide performance degradation. Provides systematic diagnostic workflows, standardizedinvestigation reports with severity-based depth (Executive Triage Cards for rapid decision-making),and incident response playbooks with phased triage (baseline → inspect → correlate → deep dive).All commands are read-only by default for production safety.
artemis-debug-secure
Database investigation skill for Jira tickets with secure credential handling. Multi-Agent Swarm for 3x faster parallel execution. Auto-learns from investigations, searches similar tickets, integrates with Jira, and detects anomalies.
postmortem
Distill a failure into a reusable principle and preflight check.
rca
Performs root cause analysis for Jenkins pipeline failures using MCP tools with evidence-backed citations, guided workflow, and concrete remediation steps.
tracing-root-causes
AI agent performs systematic root cause analysis using 5 Whys, Fishbone diagrams, and evidence-based investigation. Use when debugging, conducting post-mortems, or investigating incidents.
error-recovery
Classify workflow failures and attempt automatic recovery. Use when sprint/feature fails during implementation to determine if auto-fix is possible or manual intervention required.
aws-staging-cli
AWS CLI reference for Vessel staging environment (account 643610656178, ca-central-1). Use when troubleshooting staging infrastructure, checking logs, monitoring RDS/EC2/ECS, investigating slow queries, or debugging application errors. Covers CloudWatch Logs, RDS, EC2, ALB, Lambda, ECS, S3, SNS/SQS, Secrets Manager, WAF.
k8s-troubleshoot
Debug Kubernetes pods, services, and cluster issues. Use when the user says "pod not starting", "CrashLoopBackOff", "service not reachable", "kubectl debug", "pod stuck pending", or asks about Kubernetes problems.
performance-debug
Diagnose system performance issues including CPU, memory, disk, and network. Use when the user says "server is slow", "high CPU", "out of memory", "disk full", "performance issues", or asks to debug system performance.
log-analyze
Parse and analyze system and application logs. Use when the user says "find errors in logs", "analyze logs", "check journalctl", "what's in the logs", "debug from logs", or asks to investigate log files.
reliability-engineering
SRE principles, observability, and incident management
rca-copilot-agent
AI-powered RCA Copilot for root cause analysis and incident explanation. Use when: (1) Building incident context retrieval from Neptune and DynamoDB, (2) Implementing evidence ranking and root cause candidate generation, (3) Creating natural language incident explanations, (4) Generating recommended remediation actions. Triggers: "explain incident", "find root cause", "diagnose data issue", "what caused the alert", "RCA for incident".
kaizen
Continuous improvement methodology for SignalRoom. Use after incidents, when reviewing processes, or when looking for ways to prevent repeat problems. Implements structured retrospectives and improvement cycles.
incident-analysis
Analyze and resolve production incidents using systematic investigation, root cause analysis, and autonomous remediation
troubleshooting-notifications
Investigates Mission Control notifications to identify root causes and provide remediation. Use when users mention notification IDs, ask about alerts or notifications, request help understanding "why did I get this notification", want to troubleshoot a specific alert, or ask about notification patterns and history. This skill retrieves notification details, analyzes historical patterns, routes to resource-specific troubleshooting (config items or health checks), correlates findings, and delivers actionable remediation steps with prevention recommendations.
aws-troubleshoot
Troubleshoot AWS services using tool-first access (via MCP when available), falling back to AWS CLI when necessary. Focus on EKS, S3, ECR, EC2, SSM, networking, site-to-site VPNs, IAM Identity Center, and IAM.
bayes-reasoner
An internal cognitive engine for quantitative root cause analysis. Use this autonomously when you need to weigh competing hypotheses, prevent anchoring bias, or determine the most efficient next diagnostic step.
rca-analyst
Structured root cause analysis methodology with three-test isolation and prevention analysis
azure-troubleshoot
Troubleshoot Azure using tool-first access, falling back to Azure CLI when necessary. Focus on Virtual Machines, AKS, Azure Container Registry, Storage Accounts, and Log Analytics.
troubleshoot
Diagnose and fix issues with the Kubernetes cluster, Flux reconciliation, or deployed applications
rollback-deployment
Rollback a deployment to its previous revision. Use when a new deployment causes issues and needs to be reverted quickly. Keywords: rollback, revert, deployment, previous version, undo deployment.
diagnose-storage-issue
Diagnose storage and persistent volume issues. Checks PVC binding, storage class configuration, and volume mount status. Use when pods are stuck pending due to volume issues. Keywords: PVC, PV, storage, volume, pending, mount, storage class.
investigate-dns-config
Investigate DNSConfigForming warnings on pods. Analyzes DNS policy configuration, host network settings, and nameserver limits. Use when pods show DNSConfigForming events or DNS-related warnings. Keywords: DNS, DNSConfigForming, nameserver limit, resolv.conf, dnsPolicy, hostNetwork.
responding-to-security-incidents
Guide security incident response, investigation, and remediation processes.Use when you need to handle security breaches, classify incidents, develop response playbooks, gather forensic evidence, or coordinate remediation efforts.Trigger with phrases like "security incident response", "ransomware attack response", "data breach investigation", "incident playbook", or "security forensics".
detect-recurrence-pattern
Detect recurring patterns in issues and events. Identifies temporal, resource-based, cluster, and cascading patterns. Suggests prevention strategies. Keywords: recurrence, pattern, detection, trending, recurring, prevention, issue, analysis.
investigate-pod-failure
Deep investigation of a failing pod. Gathers logs, events, and resource status to identify root cause. Keywords: investigate, debug, pod failure, troubleshoot, root cause, logs, events.
infrastructure-maintainer
You are a reliable and proactive Infrastructure Maintainer or Site Reliability Engineer (SRE). You are an expert in cloud infrastructure (AWS, GCP, etc.), monitoring, and incident response. Your primary responsibility is to keep the lights on—ensuring the production application is stable, performant, and available.
cordon-node
Mark a node as unschedulable to prevent new pods from being scheduled. Use when a node is experiencing issues and needs maintenance. Existing pods continue running. Keywords: cordon, node maintenance, unschedulable, node issues.
diagnose-network-issue
Diagnose network connectivity issues for pods. Checks DNS resolution, service connectivity, and network policies. Use when pods cannot communicate with other services. Keywords: network issue, DNS, connectivity, service unreachable, network policy, CNI.
restart-imagepullbackoff
Handle a pod stuck in ImagePullBackOff state. First investigates the image pull error, then restarts to retry. Keywords: imagepull, image pull backoff, ErrImagePull, registry, container image, pull failed.
kubectl-debugging
Debug Kubernetes pods, nodes, and workloads using kubectl debug. Covers ephemeral containers,pod copying, node debugging, debug profiles, and interactive troubleshooting sessions.Use when user mentions kubectl debug, debugging pods, ephemeral containers, node debugging,or interactive troubleshooting in Kubernetes clusters.
restart-crashloop
Restart a pod stuck in CrashLoopBackOff. Use when pod has crashed 3+ times and a restart might resolve transient issues. Keywords: crashloop, restart, pod failure, container crash, pod stuck, pod crashing.
network-diagnostics
Connectivity troubleshooting with modern Rust-based tools - trippy (traceroute/mtr), gping (graphical ping), and ss (socket statistics). Preferred over legacy netstat/traceroute.
gemini-ssh
AI-assisted SSH operations with Gemini
datadog
Datadog CLI for debugging and triaging. Use this skill when you need to:search Datadog logs, query metrics, tail logs in real-time, trace distributed requests,investigate errors, compare time periods, find log patterns, check service health,or export observability data. Trigger phrases include "search logs", "tail logs","query metrics", "check Datadog", "find errors", "trace request", "compare errors","what services exist", "log patterns", "CPU usage", "service health".
troubleshooting
Kubernetes debugging, problem diagnosis, and issue resolution
sentry
Queries Sentry for issues, events, traces, and error analysis. Use when debugging production errors, searching issues, analyzing traces, or getting AI root cause analysis with Seer.
incident-response
Structured approach to handling production incidents, from detection through resolution and post-mortem analysis
usasoc
Activate USASOC for time-critical or mission-critical operations. User-invoked special operations deputy with wide lateral authority.
coord-intel
Invoke COORD_INTEL for investigations, forensics, and intelligence gathering
kubernetes-debugger
Kubernetes debugging and troubleshooting best practices using MCP kubernetes tools.Use when: (1) Pods are failing, pending, or in CrashLoopBackOff/ImagePullBackOff states,(2) Services are unreachable or DNS resolution fails, (3) Deployments aren't rolling out,(4) Nodes are unhealthy or unschedulable, (5) Resource issues (OOM, CPU throttling),(6) Any "why isn't my Kubernetes workload working?" questions.Provides systematic debugging workflows using kubectl_get, kubectl_describe, kubectl_logs,exec_in_pod, and other MCP kubernetes tools.
incident-response
システム障害・インシデントの検知から解決、事後分析までを体系的に支援。ITIL・SRE原則に基づき、迅速な復旧と再発防止を実現。Anchors:• The Site Reliability Workbook (Google) / 適用: ポストモーテム文化 / 目的: 非難なき事後分析と学習• ITIL 4 / 適用: インシデント・問題管理 / 目的: 構造化されたエスカレーション• The Phoenix Project (Kim, Behr) / 適用: 変更管理 / 目的: 変更起因インシデントの予防Trigger:Use when responding to system outages, handling alerts, writing incident reports, conducting postmortems, or analyzing root causes.incident, outage, postmortem, RCA, 5 whys, rollback, escalation, severity, on-call