image-backup-restore
Implementing and debugging PostgreSQL image backup and restore features. Use when working with database snapshots, backup jobs, restore operations, azcopy blob transfers, or troubleshooting image-related provisioning failures.
When & Why to Use This Skill
This Claude skill provides comprehensive technical expertise for implementing and debugging PostgreSQL image-based backup and restore workflows within Kubernetes (AKS) environments. It specializes in automating database snapshots, managing Azure Blob Storage transfers via azcopy with workload identity, and orchestrating complex restoration flows. By addressing critical infrastructure challenges like DNS propagation delays, secure credential handling, and pg_hba.conf configurations, it ensures robust disaster recovery and database lifecycle management.
Use Cases
- Automating PostgreSQL backups: Implementing logical backups using pg_dump and securing data transfers to Azure Blob Storage using azcopy with workload identity.
- Disaster Recovery & Provisioning: Orchestrating the restoration of PostgreSQL instances from existing images, including PVC creation, secret management, and data loading.
- Infrastructure Troubleshooting: Debugging common provisioning failures such as azcopy 403 authorization errors, connection timeouts during DNS propagation, and identity mismatches.
- Security & Configuration Optimization: Configuring secure password handling during restores and fine-tuning pg_hba.conf to support internal background workers like pg_durable.
| name | image-backup-restore |
|---|---|
| description | Implementing and debugging PostgreSQL image backup and restore features. Use when working with database snapshots, backup jobs, restore operations, azcopy blob transfers, or troubleshooting image-related provisioning failures. |
PostgreSQL Image Backup & Restore
Overview
Image backup creates point-in-time snapshots of PostgreSQL instances by:
- Running pg_dump to create a logical backup
- Uploading to Azure Blob Storage via azcopy
- Recording metadata in CMS database
Image restore provisions new instances from existing backups by:
- Creating PVC, Secret, ConfigMap for the new instance
- Running pg_restore to load data from blob storage
- Starting PostgreSQL with recovery configuration
Key Files
toygres-orchestrations/src/
├── orchestrations/
│ ├── create_image.rs # Backup orchestration
│ └── create_instance.rs # Restore flow (v1.0.2+)
├── activities/
│ ├── run_backup_job.rs # pg_dump + azcopy upload
│ ├── run_restore_job.rs # azcopy download + pg_restore
│ ├── deploy_postgres_from_pvc.rs # Start from existing PVC
│ ├── get_instance_password.rs # Retrieve source password
│ └── cms/
│ └── image_ops.rs # Image CMS operations
├── templates/
│ ├── backup-job.yaml # K8s Job for backup
│ ├── restore-job.yaml # K8s Job for restore
│ ├── postgres-config.yaml # ConfigMap with pg_hba.conf
│ └── postgres-secret.yaml # Secret with password
└── activity_types.rs # ImageOperation enum
Backup Flow
// create_image.rs orchestration
1. CreateImageRecord // Reserve ID in CMS
2. GetInstanceByK8sName // Get source instance details
3. RunBackupJob // pg_dump + azcopy to blob
4. UpdateImageStatus // Mark as available
Backup Job Template
# backup-job.yaml key commands
command:
- /bin/sh
- -c
- |
# Login with workload identity (NOT --identity!)
azcopy login --login-type=workload
# Create backup
PGPASSWORD=$POSTGRES_PASSWORD pg_dump -h $SOURCE_HOST -U postgres -Fc $DATABASE > /backup/backup.dump
# Upload to blob
azcopy copy /backup/backup.dump "$BLOB_URL"
Restore Flow
// create_instance.rs v1.0.2 restore path
1. CreatePvc // Create empty PVC
2. GetSourcePassword // Get password from source image
3. CreateImageRecord // Reserve in CMS (with correct image_type!)
4. RunRestoreJob // azcopy download + pg_restore
5. DeployPostgresFromPvc // Start PostgreSQL
6. TestConnection // Verify connectivity (120s timeout for DNS)
7. UpdateInstanceStatus // Mark as running
Critical: Password Handling
The restore flow must use the source image's password, not a new password:
// activity_types.rs
pub enum ImageOperation {
GetSourcePassword { id: Uuid }, // Get password from image record
// ...
}
pub enum ImageOperationResult {
PasswordFound { password: String },
// ...
}
// image_ops.rs
ImageOperation::GetSourcePassword { id } => {
let encrypted = sqlx::query_scalar::<_, Vec<u8>>(
"SELECT source_password_encrypted FROM cms.images WHERE id = $1"
)
.bind(id)
.fetch_one(pool)
.await?;
let password = String::from_utf8(encrypted)?;
Ok(ImageOperationResult::PasswordFound { password })
}
Critical: Image Type Preservation
When restoring, fetch image details BEFORE creating CMS record:
// create_instance.rs v1.0.2
let (effective_image_type, effective_postgres_version) = if let Some(ref image_id) = input.source_image_id {
ctx.trace_info(format!("[v1.0.2] Fetching source image details: {}", image_id));
let details = ctx.schedule_activity_typed::<ImageOperationInput, ImageOperationResult>(
activities::cms::image_ops::NAME,
&ImageOperationInput {
operation: ImageOperation::GetImageDetails { id: image_id.parse()? },
},
).await?;
match details {
ImageOperationResult::ImageDetails { image_type, postgres_version, .. } => {
(image_type, postgres_version)
}
_ => return Err("Failed to get image details".into()),
}
} else {
(input.image_type.clone(), postgres_version.clone())
};
azcopy Workload Identity
CRITICAL: On AKS, always use --login-type=workload:
# WRONG - Uses IMDS (VM managed identity), fails with 403
azcopy login --identity
# CORRECT - Uses federated token from pod
azcopy login --login-type=workload
Debug workload identity issues:
# Check env vars are injected
kubectl exec <pod> -- env | grep AZURE_
# Should see:
# AZURE_CLIENT_ID=...
# AZURE_TENANT_ID=...
# AZURE_FEDERATED_TOKEN_FILE=/var/run/secrets/azure/tokens/azure-identity-token
pg_hba.conf for pg_durable
The pg_durable background worker connects via TCP (127.0.0.1), not Unix socket.
Must use trust for localhost connections:
# postgres-config.yaml
pg_hba.conf: |
# IPv4 local connections (trust for pg_durable background worker)
host all all 127.0.0.1/32 trust
# IPv6 local connections (trust for pg_durable background worker)
host all all ::1/128 trust
# Remote connections require password
host all all 10.0.0.0/8 scram-sha-256
Common Issues
test_connection Timeout
Symptom: Instance provisioning fails at test_connection step.
Root cause: Azure LoadBalancer DNS propagation can take 60-90+ seconds.
Fix: Use 120s timeout:
ctx.schedule_activity_with_retry_typed::<TestConnectionInput, TestConnectionOutput>(
activities::test_connection::NAME,
&test_input,
RetryPolicy::new(5)
.with_backoff(BackoffStrategy::Exponential {
base: Duration::from_secs(2),
multiplier: 2.0,
max: Duration::from_secs(30),
})
.with_timeout(Duration::from_secs(120)), // Not 60s!
)
.await?;
pg_durable Worker Not Starting
Symptom: PostgreSQL starts but pg_durable background worker fails with "password authentication failed".
Root cause: pg_hba.conf uses scram-sha-256 for localhost TCP connections.
Fix: Change to trust for 127.0.0.1 and ::1 (see pg_hba.conf section above).
Restore Shows Wrong image_type
Symptom: Restored instance shows "stock" instead of "pg_durable" in CMS.
Root cause: Using input.image_type instead of source image's image_type.
Fix: Fetch source image details before creating CMS record (see Image Type Preservation section).
azcopy 403 AuthorizationPermissionMismatch
Symptom: azcopy login --identity succeeds but operations fail with 403.
Root cause: --identity uses IMDS, not AKS workload identity.
Fix: Use --login-type=workload (see azcopy section above).
Debugging Commands
# Watch backup/restore job logs
kubectl logs -n toygres-managed job/<job-name> -f
# Check blob storage
az storage blob list --account-name <acct> --container-name images --auth-mode login
# Test PostgreSQL connection
kubectl exec -it <postgres-pod> -n toygres-managed -- psql -U postgres -c "SELECT 1"
# Check pg_durable worker
kubectl exec -it <postgres-pod> -n toygres-managed -- psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE backend_type = 'pg_durable background worker'"
# View job completion status
kubectl get jobs -n toygres-managed
API Endpoints
// Create backup image
POST /api/images
{
"name": "my-backup",
"instance_id": "uuid",
"description": "Optional description"
}
// Create instance from image
POST /api/instances
{
"name": "restored-instance",
"password": "ignored-uses-source",
"source_image_id": "image-uuid",
"dns_label": "restored-instance"
}
Orchestration Versioning
Use start_orchestration which automatically picks the latest version:
// api.rs - uses latest version (currently 1.0.2 with restore support)
runtime
.start_orchestration(
&orchestration_id,
orchestrations::CREATE_INSTANCE,
input_json,
)
.await?;
Development Workflow: Test Then Deploy
When making changes to backup/restore functionality, follow this workflow:
1. Build Release Version
cargo build --release
2. Test Locally with Azure PostgreSQL
Connect to the same CMS database used by AKS:
# Get the database URL from AKS secret
kubectl get secret toygres-secrets -n toygres-system -o jsonpath='{.data.DATABASE_URL}' | base64 -D
# Start local server
DATABASE_URL="postgresql://azureuser:xxx@xxx.postgres.azure.com:5432/postgres?sslmode=require" \
RUST_LOG=info,toygres_server=debug,toygres_orchestrations=debug \
./target/release/toygres-server standalone
# Test endpoints
curl http://localhost:8080/health
curl -X POST http://localhost:8080/login -H "Content-Type: application/x-www-form-urlencoded" -d "username=$TOYGRES_ADMIN_USERNAME&password=$TOYGRES_ADMIN_PASSWORD" -c /tmp/cookies.txt
curl -b /tmp/cookies.txt http://localhost:8080/api/instances
curl -b /tmp/cookies.txt http://localhost:8080/api/images
3. Deploy to AKS
After local testing succeeds:
./deploy/deploy-to-aks.sh
4. Verify AKS Deployment
# Check health
curl http://4.249.80.96/health
# View logs
kubectl logs -n toygres-system -l app.kubernetes.io/component=server -f
# Check pods
kubectl get pods -n toygres-system
Key Points
- Local testing uses the same Azure PostgreSQL as AKS (not local Docker)
- This ensures CMS state and duroxide orchestrations are consistent
- The
standalonemode runs API + workers in a single process