image-backup-restore

affandar's avatarfrom affandar

Implementing and debugging PostgreSQL image backup and restore features. Use when working with database snapshots, backup jobs, restore operations, azcopy blob transfers, or troubleshooting image-related provisioning failures.

0stars🔀0forks📁View on GitHub🕐Updated Jan 11, 2026

When & Why to Use This Skill

This Claude skill provides comprehensive technical expertise for implementing and debugging PostgreSQL image-based backup and restore workflows within Kubernetes (AKS) environments. It specializes in automating database snapshots, managing Azure Blob Storage transfers via azcopy with workload identity, and orchestrating complex restoration flows. By addressing critical infrastructure challenges like DNS propagation delays, secure credential handling, and pg_hba.conf configurations, it ensures robust disaster recovery and database lifecycle management.

Use Cases

  • Automating PostgreSQL backups: Implementing logical backups using pg_dump and securing data transfers to Azure Blob Storage using azcopy with workload identity.
  • Disaster Recovery & Provisioning: Orchestrating the restoration of PostgreSQL instances from existing images, including PVC creation, secret management, and data loading.
  • Infrastructure Troubleshooting: Debugging common provisioning failures such as azcopy 403 authorization errors, connection timeouts during DNS propagation, and identity mismatches.
  • Security & Configuration Optimization: Configuring secure password handling during restores and fine-tuning pg_hba.conf to support internal background workers like pg_durable.
nameimage-backup-restore
descriptionImplementing and debugging PostgreSQL image backup and restore features. Use when working with database snapshots, backup jobs, restore operations, azcopy blob transfers, or troubleshooting image-related provisioning failures.

PostgreSQL Image Backup & Restore

Overview

Image backup creates point-in-time snapshots of PostgreSQL instances by:

  1. Running pg_dump to create a logical backup
  2. Uploading to Azure Blob Storage via azcopy
  3. Recording metadata in CMS database

Image restore provisions new instances from existing backups by:

  1. Creating PVC, Secret, ConfigMap for the new instance
  2. Running pg_restore to load data from blob storage
  3. Starting PostgreSQL with recovery configuration

Key Files

toygres-orchestrations/src/
├── orchestrations/
│   ├── create_image.rs        # Backup orchestration
│   └── create_instance.rs     # Restore flow (v1.0.2+)
├── activities/
│   ├── run_backup_job.rs      # pg_dump + azcopy upload
│   ├── run_restore_job.rs     # azcopy download + pg_restore
│   ├── deploy_postgres_from_pvc.rs  # Start from existing PVC
│   ├── get_instance_password.rs     # Retrieve source password
│   └── cms/
│       └── image_ops.rs       # Image CMS operations
├── templates/
│   ├── backup-job.yaml        # K8s Job for backup
│   ├── restore-job.yaml       # K8s Job for restore
│   ├── postgres-config.yaml   # ConfigMap with pg_hba.conf
│   └── postgres-secret.yaml   # Secret with password
└── activity_types.rs          # ImageOperation enum

Backup Flow

// create_image.rs orchestration
1. CreateImageRecord        // Reserve ID in CMS
2. GetInstanceByK8sName     // Get source instance details
3. RunBackupJob             // pg_dump + azcopy to blob
4. UpdateImageStatus        // Mark as available

Backup Job Template

# backup-job.yaml key commands
command:
  - /bin/sh
  - -c
  - |
    # Login with workload identity (NOT --identity!)
    azcopy login --login-type=workload

    # Create backup
    PGPASSWORD=$POSTGRES_PASSWORD pg_dump -h $SOURCE_HOST -U postgres -Fc $DATABASE > /backup/backup.dump

    # Upload to blob
    azcopy copy /backup/backup.dump "$BLOB_URL"

Restore Flow

// create_instance.rs v1.0.2 restore path
1. CreatePvc                // Create empty PVC
2. GetSourcePassword        // Get password from source image
3. CreateImageRecord        // Reserve in CMS (with correct image_type!)
4. RunRestoreJob            // azcopy download + pg_restore
5. DeployPostgresFromPvc    // Start PostgreSQL
6. TestConnection           // Verify connectivity (120s timeout for DNS)
7. UpdateInstanceStatus     // Mark as running

Critical: Password Handling

The restore flow must use the source image's password, not a new password:

// activity_types.rs
pub enum ImageOperation {
    GetSourcePassword { id: Uuid },  // Get password from image record
    // ...
}

pub enum ImageOperationResult {
    PasswordFound { password: String },
    // ...
}

// image_ops.rs
ImageOperation::GetSourcePassword { id } => {
    let encrypted = sqlx::query_scalar::<_, Vec<u8>>(
        "SELECT source_password_encrypted FROM cms.images WHERE id = $1"
    )
    .bind(id)
    .fetch_one(pool)
    .await?;

    let password = String::from_utf8(encrypted)?;
    Ok(ImageOperationResult::PasswordFound { password })
}

Critical: Image Type Preservation

When restoring, fetch image details BEFORE creating CMS record:

// create_instance.rs v1.0.2
let (effective_image_type, effective_postgres_version) = if let Some(ref image_id) = input.source_image_id {
    ctx.trace_info(format!("[v1.0.2] Fetching source image details: {}", image_id));

    let details = ctx.schedule_activity_typed::<ImageOperationInput, ImageOperationResult>(
        activities::cms::image_ops::NAME,
        &ImageOperationInput {
            operation: ImageOperation::GetImageDetails { id: image_id.parse()? },
        },
    ).await?;

    match details {
        ImageOperationResult::ImageDetails { image_type, postgres_version, .. } => {
            (image_type, postgres_version)
        }
        _ => return Err("Failed to get image details".into()),
    }
} else {
    (input.image_type.clone(), postgres_version.clone())
};

azcopy Workload Identity

CRITICAL: On AKS, always use --login-type=workload:

# WRONG - Uses IMDS (VM managed identity), fails with 403
azcopy login --identity

# CORRECT - Uses federated token from pod
azcopy login --login-type=workload

Debug workload identity issues:

# Check env vars are injected
kubectl exec <pod> -- env | grep AZURE_

# Should see:
# AZURE_CLIENT_ID=...
# AZURE_TENANT_ID=...
# AZURE_FEDERATED_TOKEN_FILE=/var/run/secrets/azure/tokens/azure-identity-token

pg_hba.conf for pg_durable

The pg_durable background worker connects via TCP (127.0.0.1), not Unix socket. Must use trust for localhost connections:

# postgres-config.yaml
pg_hba.conf: |
  # IPv4 local connections (trust for pg_durable background worker)
  host    all             all             127.0.0.1/32            trust
  # IPv6 local connections (trust for pg_durable background worker)
  host    all             all             ::1/128                 trust
  # Remote connections require password
  host    all             all             10.0.0.0/8              scram-sha-256

Common Issues

test_connection Timeout

Symptom: Instance provisioning fails at test_connection step.

Root cause: Azure LoadBalancer DNS propagation can take 60-90+ seconds.

Fix: Use 120s timeout:

ctx.schedule_activity_with_retry_typed::<TestConnectionInput, TestConnectionOutput>(
    activities::test_connection::NAME,
    &test_input,
    RetryPolicy::new(5)
        .with_backoff(BackoffStrategy::Exponential {
            base: Duration::from_secs(2),
            multiplier: 2.0,
            max: Duration::from_secs(30),
        })
        .with_timeout(Duration::from_secs(120)), // Not 60s!
)
.await?;

pg_durable Worker Not Starting

Symptom: PostgreSQL starts but pg_durable background worker fails with "password authentication failed".

Root cause: pg_hba.conf uses scram-sha-256 for localhost TCP connections.

Fix: Change to trust for 127.0.0.1 and ::1 (see pg_hba.conf section above).

Restore Shows Wrong image_type

Symptom: Restored instance shows "stock" instead of "pg_durable" in CMS.

Root cause: Using input.image_type instead of source image's image_type.

Fix: Fetch source image details before creating CMS record (see Image Type Preservation section).

azcopy 403 AuthorizationPermissionMismatch

Symptom: azcopy login --identity succeeds but operations fail with 403.

Root cause: --identity uses IMDS, not AKS workload identity.

Fix: Use --login-type=workload (see azcopy section above).

Debugging Commands

# Watch backup/restore job logs
kubectl logs -n toygres-managed job/<job-name> -f

# Check blob storage
az storage blob list --account-name <acct> --container-name images --auth-mode login

# Test PostgreSQL connection
kubectl exec -it <postgres-pod> -n toygres-managed -- psql -U postgres -c "SELECT 1"

# Check pg_durable worker
kubectl exec -it <postgres-pod> -n toygres-managed -- psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE backend_type = 'pg_durable background worker'"

# View job completion status
kubectl get jobs -n toygres-managed

API Endpoints

// Create backup image
POST /api/images
{
    "name": "my-backup",
    "instance_id": "uuid",
    "description": "Optional description"
}

// Create instance from image
POST /api/instances
{
    "name": "restored-instance",
    "password": "ignored-uses-source",
    "source_image_id": "image-uuid",
    "dns_label": "restored-instance"
}

Orchestration Versioning

Use start_orchestration which automatically picks the latest version:

// api.rs - uses latest version (currently 1.0.2 with restore support)
runtime
    .start_orchestration(
        &orchestration_id,
        orchestrations::CREATE_INSTANCE,
        input_json,
    )
    .await?;

Development Workflow: Test Then Deploy

When making changes to backup/restore functionality, follow this workflow:

1. Build Release Version

cargo build --release

2. Test Locally with Azure PostgreSQL

Connect to the same CMS database used by AKS:

# Get the database URL from AKS secret
kubectl get secret toygres-secrets -n toygres-system -o jsonpath='{.data.DATABASE_URL}' | base64 -D

# Start local server
DATABASE_URL="postgresql://azureuser:xxx@xxx.postgres.azure.com:5432/postgres?sslmode=require" \
RUST_LOG=info,toygres_server=debug,toygres_orchestrations=debug \
./target/release/toygres-server standalone

# Test endpoints
curl http://localhost:8080/health
curl -X POST http://localhost:8080/login -H "Content-Type: application/x-www-form-urlencoded" -d "username=$TOYGRES_ADMIN_USERNAME&password=$TOYGRES_ADMIN_PASSWORD" -c /tmp/cookies.txt
curl -b /tmp/cookies.txt http://localhost:8080/api/instances
curl -b /tmp/cookies.txt http://localhost:8080/api/images

3. Deploy to AKS

After local testing succeeds:

./deploy/deploy-to-aks.sh

4. Verify AKS Deployment

# Check health
curl http://4.249.80.96/health

# View logs
kubectl logs -n toygres-system -l app.kubernetes.io/component=server -f

# Check pods
kubectl get pods -n toygres-system

Key Points

  • Local testing uses the same Azure PostgreSQL as AKS (not local Docker)
  • This ensures CMS state and duroxide orchestrations are consistent
  • The standalone mode runs API + workers in a single process