Incident Response Skills

incident-responder

Expert incident responder specializing in security and operational incident management. Masters evidence collection, forensic analysis, and coordinated response with focus on minimizing impact and preventing future incidents.

[Incident Response]

smith-postmortem

from tianjianjiang

Incident postmortem methodology and templates. Use when conducting incident postmortems, writing postmortem reports, establishing postmortem processes, or performing post-incident analysis.

[Incident Response]

azure-ops-triage

from zer0big

Azure 구독/리소스 운영 점검을 수행할 때 사용한다.목표는 (1) 인증/구독 컨텍스트 확인 (2) 실행 중 리소스/비정상 리소스 탐색(3) 비용/스파이크 가능성 힌트 도출 (4) 조치 권고안을 체크리스트로 정리하는 것이다.

[Incident Response]

incident-responder

from sidetoolco

Handles production incidents with urgency and precision. Use IMMEDIATELY when production issues occur. Coordinates debugging, implements fixes, and documents post-mortems.

[Incident Response]

rca-verification

from handresc1127

Methods for validating root cause analyses. Provides checklists for 5 Whys depth, execution path accuracy, and fix strategy soundness. Use when reviewing RCA reports.

[Incident Response]

devops-troubleshooter

from sidetoolco

Debug production issues, analyze logs, and fix deployment failures. Masters monitoring tools, incident response, and root cause analysis. Use PROACTIVELY for production debugging or system outages.

[Incident Response]

incident-responder

from eddiebe147

Manage production incidents with structured response, debugging, and post-mortem documentation

[Incident Response]

error-detective

from sidetoolco

Search logs and codebases for error patterns, stack traces, and anomalies. Correlates errors across systems and identifies root causes. Use PROACTIVELY when debugging issues, analyzing logs, or investigating production errors.

[Incident Response]

k8s-troubleshooter

from randybias

Use this skill when users report Kubernetes cluster issues, pod failures, or need incident response.Comprehensive troubleshooting for diagnosing cluster, workload, networking, storage, and Helm issues.Invoke for: pods not starting (Pending, CrashLoopBackOff, ImagePull), service connectivity problems,DNS resolution failures, storage/PVC issues, node health problems, CNI/Calico networking, Helm releasefailures, or cluster-wide performance degradation. Provides systematic diagnostic workflows, standardizedinvestigation reports with severity-based depth (Executive Triage Cards for rapid decision-making),and incident response playbooks with phased triage (baseline → inspect → correlate → deep dive).All commands are read-only by default for production safety.

[Incident Response]

artemis-debug-secure

from RithyTep

Database investigation skill for Jira tickets with secure credential handling. Multi-Agent Swarm for 3x faster parallel execution. Auto-learns from investigations, searches similar tickets, integrates with Jira, and detects anomalies.

[Incident Response]

postmortem

from isymchych

Distill a failure into a reusable principle and preflight check.

[Incident Response]

rca

from StevenBuglione

Performs root cause analysis for Jenkins pipeline failures using MCP tools with evidence-backed citations, guided workflow, and concrete remediation steps.

[Incident Response]

tracing-root-causes

from doanchienthangdev

AI agent performs systematic root cause analysis using 5 Whys, Fishbone diagrams, and evidence-based investigation. Use when debugging, conducting post-mortems, or investigating incidents.

[Incident Response]

error-recovery

from Sjdjdiejdrirhdkjej

Classify workflow failures and attempt automatic recovery. Use when sprint/feature fails during implementation to determine if auto-fix is possible or manual intervention required.

[Incident Response]

aws-staging-cli

from robBowes

AWS CLI reference for Vessel staging environment (account 643610656178, ca-central-1). Use when troubleshooting staging infrastructure, checking logs, monitoring RDS/EC2/ECS, investigating slow queries, or debugging application errors. Covers CloudWatch Logs, RDS, EC2, ALB, Lambda, ECS, S3, SNS/SQS, Secrets Manager, WAF.

[Incident Response]

k8s-troubleshoot

from mhalder

Debug Kubernetes pods, services, and cluster issues. Use when the user says "pod not starting", "CrashLoopBackOff", "service not reachable", "kubectl debug", "pod stuck pending", or asks about Kubernetes problems.

[Incident Response]

performance-debug

from mhalder

Diagnose system performance issues including CPU, memory, disk, and network. Use when the user says "server is slow", "high CPU", "out of memory", "disk full", "performance issues", or asks to debug system performance.

[Incident Response]

log-analyze

from mhalder

Parse and analyze system and application logs. Use when the user says "find errors in logs", "analyze logs", "check journalctl", "what's in the logs", "debug from logs", or asks to investigate log files.

[Incident Response]

reliability-engineering

from miles990

SRE principles, observability, and incident management

[Incident Response]

rca-copilot-agent

from Kart-rc

AI-powered RCA Copilot for root cause analysis and incident explanation. Use when: (1) Building incident context retrieval from Neptune and DynamoDB, (2) Implementing evidence ranking and root cause candidate generation, (3) Creating natural language incident explanations, (4) Generating recommended remediation actions. Triggers: "explain incident", "find root cause", "diagnose data issue", "what caused the alert", "RCA for incident".

[Incident Response]

kaizen

from majiayu000

Continuous improvement methodology for SignalRoom. Use after incidents, when reviewing processes, or when looking for ways to prevent repeat problems. Implements structured retrospectives and improvement cycles.

[Incident Response]

incident-analysis

from majiayu000

Analyze and resolve production incidents using systematic investigation, root cause analysis, and autonomous remediation

[Incident Response]

troubleshooting-notifications

from majiayu000

Investigates Mission Control notifications to identify root causes and provide remediation. Use when users mention notification IDs, ask about alerts or notifications, request help understanding "why did I get this notification", want to troubleshoot a specific alert, or ask about notification patterns and history. This skill retrieves notification details, analyzes historical patterns, routes to resource-specific troubleshooting (config items or health checks), correlates findings, and delivers actionable remediation steps with prevention recommendations.

[Incident Response]

aws-troubleshoot

from majiayu000

Troubleshoot AWS services using tool-first access (via MCP when available), falling back to AWS CLI when necessary. Focus on EKS, S3, ECR, EC2, SSM, networking, site-to-site VPNs, IAM Identity Center, and IAM.

[Incident Response]

bayes-reasoner

from majiayu000

An internal cognitive engine for quantitative root cause analysis. Use this autonomously when you need to weigh competing hypotheses, prevent anchoring bias, or determine the most efficient next diagnostic step.

[Incident Response]

rca-analyst

from violetio

Structured root cause analysis methodology with three-test isolation and prevention analysis

[Incident Response]

azure-troubleshoot

from majiayu000

Troubleshoot Azure using tool-first access, falling back to Azure CLI when necessary. Focus on Virtual Machines, AKS, Azure Container Registry, Storage Accounts, and Log Analytics.

[Incident Response]

troubleshoot

from sagaragas

Diagnose and fix issues with the Kubernetes cluster, Flux reconciliation, or deployed applications

[Incident Response]

rollback-deployment

from X-McKay

Rollback a deployment to its previous revision. Use when a new deployment causes issues and needs to be reverted quickly. Keywords: rollback, revert, deployment, previous version, undo deployment.

[Incident Response]

diagnose-storage-issue

from X-McKay

Diagnose storage and persistent volume issues. Checks PVC binding, storage class configuration, and volume mount status. Use when pods are stuck pending due to volume issues. Keywords: PVC, PV, storage, volume, pending, mount, storage class.

[Incident Response]

investigate-dns-config

from X-McKay

Investigate DNSConfigForming warnings on pods. Analyzes DNS policy configuration, host network settings, and nameserver limits. Use when pods show DNSConfigForming events or DNS-related warnings. Keywords: DNS, DNSConfigForming, nameserver limit, resolv.conf, dnsPolicy, hostNetwork.

[Incident Response]

responding-to-security-incidents

from BbgnsurfTech

Guide security incident response, investigation, and remediation processes.Use when you need to handle security breaches, classify incidents, develop response playbooks, gather forensic evidence, or coordinate remediation efforts.Trigger with phrases like "security incident response", "ransomware attack response", "data breach investigation", "incident playbook", or "security forensics".

[Incident Response]

detect-recurrence-pattern

from X-McKay

Detect recurring patterns in issues and events. Identifies temporal, resource-based, cluster, and cascading patterns. Suggests prevention strategies. Keywords: recurrence, pattern, detection, trending, recurring, prevention, issue, analysis.

[Incident Response]

investigate-pod-failure

from X-McKay

Deep investigation of a failing pod. Gathers logs, events, and resource status to identify root cause. Keywords: investigate, debug, pod failure, troubleshoot, root cause, logs, events.

[Incident Response]

infrastructure-maintainer

from aibangjuxin

You are a reliable and proactive Infrastructure Maintainer or Site Reliability Engineer (SRE). You are an expert in cloud infrastructure (AWS, GCP, etc.), monitoring, and incident response. Your primary responsibility is to keep the lights on—ensuring the production application is stable, performant, and available.

[Incident Response]

cordon-node

from X-McKay

Mark a node as unschedulable to prevent new pods from being scheduled. Use when a node is experiencing issues and needs maintenance. Existing pods continue running. Keywords: cordon, node maintenance, unschedulable, node issues.

[Incident Response]

diagnose-network-issue

from X-McKay

Diagnose network connectivity issues for pods. Checks DNS resolution, service connectivity, and network policies. Use when pods cannot communicate with other services. Keywords: network issue, DNS, connectivity, service unreachable, network policy, CNI.

[Incident Response]

restart-imagepullbackoff

from X-McKay

Handle a pod stuck in ImagePullBackOff state. First investigates the image pull error, then restarts to retry. Keywords: imagepull, image pull backoff, ErrImagePull, registry, container image, pull failed.

[Incident Response]

kubectl-debugging

from laurigates

Debug Kubernetes pods, nodes, and workloads using kubectl debug. Covers ephemeral containers,pod copying, node debugging, debug profiles, and interactive troubleshooting sessions.Use when user mentions kubectl debug, debugging pods, ephemeral containers, node debugging,or interactive troubleshooting in Kubernetes clusters.

[Incident Response]

restart-crashloop

from X-McKay

Restart a pod stuck in CrashLoopBackOff. Use when pod has crashed 3+ times and a restart might resolve transient issues. Keywords: crashloop, restart, pod failure, container crash, pod stuck, pod crashing.

[Incident Response]

network-diagnostics

from laurigates

Connectivity troubleshooting with modern Rust-based tools - trippy (traceroute/mtr), gping (graphical ping), and ss (socket statistics). Preferred over legacy netstat/traceroute.

[Incident Response]

gemini-ssh

from oimiragieo

AI-assisted SSH operations with Gemini

[Incident Response]

datadog

from MichaelVessia

Datadog CLI for debugging and triaging. Use this skill when you need to:search Datadog logs, query metrics, tail logs in real-time, trace distributed requests,investigate errors, compare time periods, find log patterns, check service health,or export observability data. Trigger phrases include "search logs", "tail logs","query metrics", "check Datadog", "find errors", "trace request", "compare errors","what services exist", "log patterns", "CPU usage", "service health".

[Incident Response]

troubleshooting

from pluginagentmarketplace

Kubernetes debugging, problem diagnosis, and issue resolution

[Incident Response]

sentry

from Uzaaft

Queries Sentry for issues, events, traces, and error analysis. Use when debugging production errors, searching issues, analyzing traces, or getting AI root cause analysis with Seer.

[Incident Response]

incident-response

from cyperx84

Structured approach to handling production incidents, from detection through resolution and post-mortem analysis

[Incident Response]

usasoc

from Euda1mon1a

Activate USASOC for time-critical or mission-critical operations. User-invoked special operations deputy with wide lateral authority.

[Incident Response]

coord-intel

from Euda1mon1a

Invoke COORD_INTEL for investigations, forensics, and intelligence gathering

[Incident Response]

kubernetes-debugger

from rodrigodelmonte

Kubernetes debugging and troubleshooting best practices using MCP kubernetes tools.Use when: (1) Pods are failing, pending, or in CrashLoopBackOff/ImagePullBackOff states,(2) Services are unreachable or DNS resolution fails, (3) Deployments aren't rolling out,(4) Nodes are unhealthy or unschedulable, (5) Resource issues (OOM, CPU throttling),(6) Any "why isn't my Kubernetes workload working?" questions.Provides systematic debugging workflows using kubectl_get, kubectl_describe, kubectl_logs,exec_in_pod, and other MCP kubernetes tools.

[Incident Response]

incident-response

from daishiman

システム障害・インシデントの検知から解決、事後分析までを体系的に支援。ITIL・SRE原則に基づき、迅速な復旧と再発防止を実現。Anchors:• The Site Reliability Workbook (Google) / 適用: ポストモーテム文化 / 目的: 非難なき事後分析と学習• ITIL 4 / 適用: インシデント・問題管理 / 目的: 構造化されたエスカレーション• The Phoenix Project (Kim, Behr) / 適用: 変更管理 / 目的: 変更起因インシデントの予防Trigger:Use when responding to system outages, handling alerts, writing incident reports, conducting postmortems, or analyzing root causes.incident, outage, postmortem, RCA, 5 whys, rollback, escalation, severity, on-call

[Incident Response]

← Back to All Skills