planning-disaster-recovery
Use when you need to work with backup and recovery.This skill provides backup automation and disaster recovery with comprehensive guidance and automation.Trigger with phrases like "create backups", "automate backups",or "implement disaster recovery".
When & Why to Use This Skill
This Claude skill streamlines backup automation and disaster recovery by providing comprehensive guidance and automated workflows. It enables users to implement robust data protection strategies using industry-standard tools like tar, rsync, and AWS S3, ensuring system resilience, data integrity, and rapid recovery from critical failures through structured planning and execution.
Use Cases
- Automating Cloud Backups: Schedule and manage off-site data storage by integrating AWS S3 for secure, scalable backup solutions.
- Disaster Recovery Planning: Generate detailed assessment reports and implementation plans to minimize downtime during infrastructure or system outages.
- Server Data Synchronization: Utilize rsync and tar to create efficient, compressed archives of local file systems and synchronize them across remote environments.
- Operational Runbook Generation: Automatically produce step-by-step documentation and scripts for maintenance, troubleshooting, and emergency data restoration procedures.
- System State Archiving: Implement automated scripts to capture system configurations and baseline metrics for consistent recovery points.
| name | planning-disaster-recovery |
|---|---|
| description | | |
| - Bash(tar | *, rsync:*, aws:s3:*) |
| version | 1.0.0 |
| license | MIT |
Prerequisites
Before using this skill, ensure:
- Required credentials and permissions for the operations
- Understanding of the system architecture and dependencies
- Backup of critical data before making structural changes
- Access to relevant documentation and configuration files
- Monitoring tools configured for observability
- Development or staging environment available for testing
Instructions
Step 1: Assess Current State
- Review current configuration, setup, and baseline metrics
- Identify specific requirements, goals, and constraints
- Document existing patterns, issues, and pain points
- Analyze dependencies and integration points
- Validate all prerequisites are met before proceeding
Step 2: Design Solution
- Define optimal approach based on best practices
- Create detailed implementation plan with clear steps
- Identify potential risks and mitigation strategies
- Document expected outcomes and success criteria
- Review plan with team or stakeholders if needed
Step 3: Implement Changes
- Execute implementation in non-production environment first
- Verify changes work as expected with thorough testing
- Monitor for any issues, errors, or performance impacts
- Document all changes, decisions, and configurations
- Prepare rollback plan and recovery procedures
Step 4: Validate Implementation
- Run comprehensive tests to verify all functionality
- Compare performance metrics against baseline
- Confirm no unintended side effects or regressions
- Update all relevant documentation
- Obtain approval before production deployment
Step 5: Deploy to Production
- Schedule deployment during appropriate maintenance window
- Execute implementation with real-time monitoring
- Watch closely for any issues or anomalies
- Verify successful deployment and functionality
- Document completion, metrics, and lessons learned
Output
This skill produces:
Implementation Artifacts: Scripts, configuration files, code, and automation tools
Documentation: Comprehensive documentation of changes, procedures, and architecture
Test Results: Validation reports, test coverage, and quality metrics
Monitoring Configuration: Dashboards, alerts, metrics, and observability setup
Runbooks: Operational procedures for maintenance, troubleshooting, and incident response
Error Handling
Permission and Access Issues:
- Verify credentials and permissions for all operations
- Request elevated access if required for specific tasks
- Document all permission requirements for automation
- Use separate service accounts for privileged operations
- Implement least-privilege access principles
Connection and Network Failures:
- Check network connectivity, firewalls, and security groups
- Verify service endpoints, DNS resolution, and routing
- Test connections using diagnostic and troubleshooting tools
- Review network policies, ACLs, and security configurations
- Implement retry logic with exponential backoff
Resource Constraints:
- Monitor resource usage (CPU, memory, disk, network)
- Implement throttling, rate limiting, or queue mechanisms
- Schedule resource-intensive tasks during low-traffic periods
- Scale infrastructure resources if consistently hitting limits
- Optimize queries, code, or configurations for efficiency
Configuration and Syntax Errors:
- Validate all configuration syntax before applying changes
- Test configurations thoroughly in non-production first
- Implement automated configuration validation checks
- Maintain version control for all configuration files
- Keep previous working configuration for quick rollback
Resources
Configuration Templates: {baseDir}/templates/disaster-recovery-planner/
Documentation and Guides: {baseDir}/docs/disaster-recovery-planner/
Example Scripts and Code: {baseDir}/examples/disaster-recovery-planner/
Troubleshooting Guide: {baseDir}/docs/disaster-recovery-planner-troubleshooting.md
Best Practices: {baseDir}/docs/disaster-recovery-planner-best-practices.md
Monitoring Setup: {baseDir}/monitoring/disaster-recovery-planner-dashboard.json