composable-rust-production
Expert knowledge for deploying and operating Composable Rust applications in production. Use when setting up database migrations, configuring connection pools, implementing backup/restore procedures, tuning performance, setting up monitoring and observability, or handling operational concerns like disaster recovery and production database management.
When & Why to Use This Skill
This Claude skill provides expert-level guidance and automated assistance for the production lifecycle of Composable Rust applications. It covers critical operational tasks including idempotent database migrations using sqlx, optimized connection pool configuration, automated backup and restoration strategies, PostgreSQL performance tuning, and comprehensive monitoring setups. It also includes detailed disaster recovery runbooks and troubleshooting guides to ensure high availability and system reliability for Rust-based backend services.
Use Cases
- Automating idempotent database migrations and schema changes within a CI/CD pipeline for Rust services.
- Configuring and right-sizing PostgreSQL connection pools to handle high-traffic production loads efficiently using PgPoolOptions.
- Implementing robust database backup, compression, and cloud storage upload scripts with defined retention policies.
- Tuning PostgreSQL parameters and creating optimized indexes specifically for event-sourcing workloads to improve query performance.
- Setting up observability dashboards, health check endpoints, and alert thresholds to track database metrics and application health.
- Executing disaster recovery procedures and restoration tests to ensure system resilience and meet RTO/RPO targets.
| name | composable-rust-production |
|---|---|
| description | Expert knowledge for deploying and operating Composable Rust applications in production. Use when setting up database migrations, configuring connection pools, implementing backup/restore procedures, tuning performance, setting up monitoring and observability, or handling operational concerns like disaster recovery and production database management. |
Composable Rust Production Operations Expert
Expert knowledge for production deployment and operations of Composable Rust applications - database migrations, connection pooling, backup/restore, performance tuning, monitoring, and operational excellence.
When to Use This Skill
Automatically apply when:
- Setting up database migrations or schema changes
- Configuring connection pools for production
- Implementing backup and restore procedures
- Performance tuning PostgreSQL or application code
- Setting up monitoring, metrics, or observability
- Implementing disaster recovery procedures
- Production database operations
- Troubleshooting production issues
Database Migrations (sqlx)
Running Migrations
Option 1: Helper Function (Deployment Scripts)
use composable_rust_postgres::run_migrations;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let database_url = std::env::var("DATABASE_URL")?;
// Run migrations during startup
run_migrations(&database_url).await?;
println!("Database ready!");
Ok(())
}
Option 2: EventStore Method (When Store Exists)
use composable_rust_postgres::PostgresEventStore;
let store = PostgresEventStore::new(&database_url).await?;
store.run_migrations().await?;
Option 3: sqlx CLI (Development)
# Install CLI
cargo install sqlx-cli --no-default-features --features postgres
# Run migrations
sqlx migrate run --database-url postgres://localhost/mydb
# Revert last migration
sqlx migrate revert --database-url postgres://localhost/mydb
Creating Migrations
Step 1: Create SQL file with sequential number
# migrations/003_add_user_context.sql
Step 2: Write idempotent SQL
-- Add user_context column to events table
ALTER TABLE events
ADD COLUMN IF NOT EXISTS user_context JSONB;
-- Add index
CREATE INDEX IF NOT EXISTS idx_events_user_context
ON events USING GIN (user_context);
Critical Rules:
- ✅ Always use
IF NOT EXISTSfor idempotency - ✅ One logical change per migration
- ✅ Test on production data copy first
- ❌ Never edit existing migrations (create new one)
- ❌ Never delete migrations (breaks version tracking)
Migration Best Practices
-- ✅ GOOD: Idempotent
CREATE TABLE IF NOT EXISTS orders (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- ✅ GOOD: Safe column addition
ALTER TABLE orders
ADD COLUMN IF NOT EXISTS status TEXT DEFAULT 'pending';
-- ❌ BAD: Not idempotent
CREATE TABLE orders (...); -- Fails on second run
-- ❌ BAD: Destructive
DROP TABLE old_orders; -- Can't be undone
Connection Pooling
Production Pool Configuration
use sqlx::postgres::PgPoolOptions;
use std::time::Duration;
let pool = PgPoolOptions::new()
// Connection limits
.max_connections(20) // Max concurrent
.min_connections(5) // Keep warm
// Timeouts
.acquire_timeout(Duration::from_secs(10)) // Wait for conn
.idle_timeout(Duration::from_secs(600)) // 10min idle
.max_lifetime(Duration::from_secs(1800)) // 30min recycle
// Health
.test_before_acquire(true) // Validate conn
.connect(&database_url)
.await?;
let store = PostgresEventStore::from_pool(pool);
Sizing Guidelines
Formula: max_connections = (req/sec × conn_time) / req_time + buffer
Example:
- Load: 1000 req/sec
- Request time: 50ms
- Connection hold: 10ms
- Calculation:
(1000 × 0.010) / 0.050 + 5 = 25 connections
Recommendations by Load:
| Environment | Max Connections | Use Case |
|---|---|---|
| Development | 5 | Minimal overhead |
| Staging | 10-20 | Simulate production |
| Low traffic | 20-50 | < 100 req/sec |
| Medium traffic | 50-100 | 100-1000 req/sec |
| High traffic | 100-200 | > 1000 req/sec |
PostgreSQL Limits:
# postgresql.conf
max_connections = 200 # Reserve some for admin/monitoring
Pool Monitoring
// Check pool health
let size = pool.size(); // Current connections
let idle = pool.num_idle(); // Available connections
tracing::info!(
pool_size = size,
pool_idle = idle,
"Connection pool status"
);
// Metrics to track:
// - pool.size() < max_connections (not exhausted)
// - pool.num_idle() > 0 (connections available)
// - Acquire time < 100ms (fast acquisition)
// - Connection errors ≈ 0 (healthy pool)
Backup and Restore
Full Database Backup
# Backup with compression
pg_dump -h localhost -U postgres -d mydb \
--format=custom \
--compress=9 \
--file=backup_$(date +%Y%m%d_%H%M%S).dump
# Restore
pg_restore -h localhost -U postgres \
--clean --create \
--dbname=postgres \
backup_20250110_143022.dump
Events-Only Backup (Faster)
# Backup events table (source of truth)
pg_dump -h localhost -U postgres -d mydb \
--table=events \
--format=custom \
--file=events_backup_$(date +%Y%m%d_%H%M%S).dump
# Restore
pg_restore -h localhost -U postgres \
--dbname=mydb \
--table=events \
events_backup_20250110_143022.dump
Automated Backup Script
#!/bin/bash
# daily-backup.sh
set -e
BACKUP_DIR="/backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="mydb"
RETENTION_DAYS=30
# Create backup
pg_dump -h localhost -U postgres -d $DB_NAME \
--format=custom --compress=9 \
--file=$BACKUP_DIR/backup_$DATE.dump
# Upload to S3 (optional)
aws s3 cp $BACKUP_DIR/backup_$DATE.dump \
s3://my-backups/postgres/backup_$DATE.dump
# Delete old backups
find $BACKUP_DIR -name "backup_*.dump" \
-mtime +$RETENTION_DAYS -delete
echo "Backup completed: backup_$DATE.dump"
Cron Schedule:
# Daily at 2 AM
0 2 * * * /usr/local/bin/daily-backup.sh >> /var/log/backup.log 2>&1
Testing Backups
Critical: Test restoration monthly!
# 1. Create test database
createdb -h localhost -U postgres mydb_test
# 2. Restore backup
pg_restore -h localhost -U postgres \
--dbname=mydb_test \
backup_latest.dump
# 3. Verify data
psql -h localhost -U postgres -d mydb_test \
-c "SELECT COUNT(*) FROM events;"
# 4. Cleanup
dropdb -h localhost -U postgres mydb_test
Performance Tuning
PostgreSQL Configuration
For Event Sourcing Workloads (postgresql.conf):
# Memory (assume 16GB RAM server)
shared_buffers = 4GB # 25% of RAM
effective_cache_size = 12GB # 75% of RAM
work_mem = 64MB # Per-operation memory
# Write Performance
wal_buffers = 16MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB
min_wal_size = 1GB
# Query Performance (SSDs)
random_page_cost = 1.1 # Default 4.0 for HDDs
effective_io_concurrency = 200
# Logging
log_min_duration_statement = 1000 # Log queries > 1s
log_checkpoints = on
Index Optimization
Default Event Store Indexes:
-- Primary key (automatic)
PRIMARY KEY (stream_id, version)
-- Query by time
CREATE INDEX idx_events_created ON events(created_at);
-- Query by type
CREATE INDEX idx_events_type ON events(event_type);
Custom Indexes for Your Workload:
-- Query by metadata fields
CREATE INDEX idx_events_user_id
ON events ((metadata->>'user_id'));
-- Query by correlation ID
CREATE INDEX idx_events_correlation
ON events ((metadata->>'correlation_id'));
-- Partial index for recent events
CREATE INDEX idx_events_recent
ON events(created_at)
WHERE created_at > NOW() - INTERVAL '30 days';
Table Maintenance
# Regular maintenance
psql -c "VACUUM ANALYZE events;"
# Full vacuum (locks table - maintenance window)
psql -c "VACUUM FULL events;"
# Reindex if bloated
psql -c "REINDEX TABLE events;"
Automate with autovacuum (postgresql.conf):
autovacuum = on
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.05
Monitoring and Observability
Health Check Endpoint
use axum::{Json, extract::State};
use serde::Serialize;
#[derive(Serialize)]
struct HealthStatus {
status: String,
database: String,
event_count: i64,
}
async fn health_check(
State(store): State<Arc<PostgresEventStore>>,
) -> Json<HealthStatus> {
// Test database connectivity
let db_status = match sqlx::query("SELECT 1")
.execute(store.pool())
.await
{
Ok(_) => "healthy",
Err(_) => "unhealthy",
};
// Get event count
let event_count: (i64,) = sqlx::query_as("SELECT COUNT(*) FROM events")
.fetch_one(store.pool())
.await
.unwrap_or((0,));
Json(HealthStatus {
status: if db_status == "healthy" { "ok" } else { "error" }.into(),
database: db_status.into(),
event_count: event_count.0,
})
}
Key Metrics to Track
Database Metrics:
-- Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- Database size
SELECT pg_size_pretty(pg_database_size('mydb'));
-- Table size
SELECT pg_size_pretty(pg_total_relation_size('events'));
-- Slow queries (requires pg_stat_statements)
SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
Application Metrics:
use metrics::{counter, histogram, gauge};
// Track reducer execution time
let start = Instant::now();
let effects = reducer.reduce(&mut state, action, &env);
histogram!("reducer_duration_ms").record(start.elapsed().as_millis() as f64);
// Track effect execution
counter!("effects_executed_total").increment(effects.len() as u64);
// Track store state
gauge!("store_state_size_bytes").set(state_size as f64);
Alert Thresholds
Set up alerts for:
- Connection pool exhausted:
pool.num_idle() == 0for > 1 minute - High query latency: p99 > 1 second
- Disk space low: < 20% free
- Replication lag: > 60 seconds (if using replication)
- Failed backups: Last successful backup > 24 hours ago
- Error rate: > 1% of requests failing
Disaster Recovery
Recovery Time Objective (RTO)
Target: < 4 hours to full recovery
Steps:
- Provision new PostgreSQL instance (30 min)
- Restore from backup (1-2 hours, depends on size)
- Verify data integrity (30 min)
- Reconfigure applications (30 min)
Recovery Point Objective (RPO)
Target: < 1 hour of data loss
Strategies:
- Tier 1 (No data loss): Continuous WAL archiving + PITR
- Tier 2 (< 1 hour): Hourly backups
- Tier 3 (< 24 hours): Daily backups
Disaster Recovery Runbook
Step 1: Assess the situation
# Check PostgreSQL logs
tail -100 /var/log/postgresql/postgresql.log
# Check disk space
df -h
# Check system resources
top
iostat
Step 2: Attempt quick recovery
# Restart PostgreSQL
systemctl restart postgresql
# Verify
systemctl status postgresql
psql -c "SELECT 1"
Step 3: If restart fails, restore from backup
# Stop application
systemctl stop myapp
# Restore latest backup
pg_restore -h localhost -U postgres \
--clean --create --dbname=postgres \
/backups/postgres/backup_latest.dump
# Run migrations
./myapp migrate
# Start application
systemctl start myapp
Step 4: Verify recovery
# Check event count
psql -c "SELECT COUNT(*) FROM events;"
# Check application health
curl http://localhost:8080/health
Production Checklist
Before deploying to production, verify:
- Migrations automated in deployment pipeline
- Connection pool sized for expected load
- Daily automated backups configured
- Backup restoration tested (within last 30 days)
- Monitoring and alerting set up
- Performance tuning applied (postgresql.conf)
- Disaster recovery runbook documented and tested
- Database access secured (SSL, firewall, strong passwords)
- High availability configured (if needed: replication, failover)
- Resource limits set (connection limits, rate limiting)
Troubleshooting Guide
Connection Pool Exhausted
Symptom: "Timeout acquiring connection from pool"
Diagnosis:
let size = pool.size();
let idle = pool.num_idle();
tracing::error!(size, idle, "Pool exhausted");
Solutions:
- Increase
max_connectionsin pool config - Reduce connection hold time (optimize queries)
- Add connection timeout handling
- Scale horizontally (more app instances)
Slow Queries
Symptom: High latency on database operations
Diagnosis:
-- Enable query logging
SET log_min_duration_statement = 100;
-- Check slow queries
SELECT query, mean_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC;
Solutions:
- Add missing indexes
- Optimize query patterns
- Use EXPLAIN ANALYZE
- Consider read replicas for queries
High Memory Usage
Symptom: PostgreSQL using excessive memory
Diagnosis:
SELECT pg_size_pretty(pg_database_size('mydb'));
SELECT pg_size_pretty(pg_total_relation_size('events'));
Solutions:
- Run VACUUM ANALYZE
- Reduce
shared_buffersif too high - Optimize
work_memfor query complexity - Archive old events to separate table
See Also
- Architecture: See
composable-rust-architectureskill for core patterns - Event Sourcing: See
composable-rust-event-sourcingskill for event design - Testing: See
composable-rust-testingskill for integration tests - Documentation:
docs/production-database.md(800+ lines) - Observability:
docs/observability.mdfor tracing and metrics