infrastructure
Infrastructure as Code (IaC) practices for provisioning and managing cloud resources. Use when provisioning cloud resources, managing infrastructure changes, setting up environments, disaster recovery planning, or infrastructure audits.
When & Why to Use This Skill
This Claude skill provides comprehensive expertise in Infrastructure as Code (IaC) to automate the provisioning, management, and scaling of cloud environments. It offers actionable guidance on industry-standard tools like Terraform, Pulumi, and AWS CloudFormation, helping teams implement version-controlled, modular, and secure infrastructure patterns that eliminate manual configuration drift.
Use Cases
- Automated Resource Provisioning: Rapidly deploying compute, storage, and networking resources across multi-cloud environments (AWS, Azure, GCP) using declarative code.
- Environment Consistency: Managing infrastructure changes across development, staging, and production to ensure parity and prevent configuration drift.
- Disaster Recovery & High Availability: Designing and implementing multi-region failover architectures and automated backup strategies for business continuity.
- Security & Compliance as Code: Establishing least-privilege IAM policies, network security groups, and encryption standards directly within the infrastructure definition.
- Cost Optimization & Management: Implementing resource tagging, right-sizing strategies, and scheduled scaling to monitor and reduce cloud expenditure.
- CI/CD for Infrastructure: Integrating infrastructure testing and deployment into automated pipelines with plan previews and policy-as-code validation.
| name | infrastructure |
|---|---|
| description | Infrastructure as Code (IaC) practices for provisioning and managing cloud resources. Use when provisioning cloud resources, managing infrastructure changes, setting up environments, disaster recovery planning, or infrastructure audits. |
Infrastructure as Code
Core Principle
Infrastructure is code. Version it, test it, review it, automate it.
Infrastructure should be defined declaratively, stored in version control, reviewed like code, and deployed automatically. Manual infrastructure changes lead to drift, inconsistency, and production issues. Treat your infrastructure with the same rigor as application code.
When to Use
Use this skill when:
- Provisioning new cloud resources (compute, storage, networking)
- Managing infrastructure changes across environments
- Setting up new environments (dev, staging, production)
- Implementing disaster recovery and business continuity
- Conducting infrastructure audits and cost optimization
- Migrating infrastructure between cloud providers
- Establishing infrastructure standards and patterns
- Scaling infrastructure to meet demand
- Implementing security and compliance requirements
Infrastructure as Code Tools
Terraform (Multi-Cloud)
Best for: Multi-cloud deployments, complex infrastructure, mature ecosystems
# Example: Terraform AWS EC2 instance
provider "aws" {
region = "us-west-2"
}
resource "aws_instance" "web_server" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "web-server"
Environment = "production"
ManagedBy = "terraform"
}
vpc_security_group_ids = [aws_security_group.web_sg.id]
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
systemctl enable httpd
EOF
}
resource "aws_security_group" "web_sg" {
name = "web-security-group"
description = "Allow HTTP and HTTPS traffic"
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Key Features:
- Declarative configuration language (HCL)
- Multi-cloud support (AWS, Azure, GCP, 1000+ providers)
- State management (tracks actual vs desired state)
- Plan preview before applying changes
- Module system for reusability
- Large community and ecosystem
Pulumi (Code-First IaC)
Best for: Teams preferring general-purpose languages, complex logic, existing code reuse
// Example: Pulumi AWS S3 bucket with TypeScript
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
// Create an S3 bucket with versioning and encryption
const bucket = new aws.s3.Bucket("my-bucket", {
versioning: {
enabled: true,
},
serverSideEncryptionConfiguration: {
rule: {
applyServerSideEncryptionByDefault: {
sseAlgorithm: "AES256",
},
},
},
tags: {
Environment: "production",
ManagedBy: "pulumi",
},
});
// Create a bucket policy
const bucketPolicy = new aws.s3.BucketPolicy("my-bucket-policy", {
bucket: bucket.id,
policy: bucket.arn.apply(arn => JSON.stringify({
Version: "2012-10-17",
Statement: [{
Effect: "Allow",
Principal: "*",
Action: "s3:GetObject",
Resource: `${arn}/*`,
Condition: {
IpAddress: {
"aws:SourceIp": ["203.0.113.0/24"]
}
}
}],
})),
});
// Export the bucket name
export const bucketName = bucket.id;
export const bucketArn = bucket.arn;
Key Features:
- Use familiar programming languages (TypeScript, Python, Go, C#)
- Full programming language features (loops, conditionals, functions)
- Strong typing and IDE support
- Component-based architecture
- Automatic state management
- Built-in secrets management
AWS CloudFormation
Best for: AWS-only infrastructure, AWS-native tooling
# Example: CloudFormation VPC with subnets
AWSTemplateFormatVersion: '2010-09-09'
Description: 'VPC with public and private subnets'
Parameters:
EnvironmentName:
Type: String
Default: production
Description: Environment name prefix
Resources:
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: Name
Value: !Sub '${EnvironmentName}-vpc'
PublicSubnet1:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
AvailabilityZone: !Select [0, !GetAZs '']
CidrBlock: 10.0.1.0/24
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub '${EnvironmentName}-public-subnet-1'
PrivateSubnet1:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
AvailabilityZone: !Select [0, !GetAZs '']
CidrBlock: 10.0.11.0/24
Tags:
- Key: Name
Value: !Sub '${EnvironmentName}-private-subnet-1'
InternetGateway:
Type: AWS::EC2::InternetGateway
Properties:
Tags:
- Key: Name
Value: !Sub '${EnvironmentName}-igw'
AttachGateway:
Type: AWS::EC2::VPCGatewayAttachment
Properties:
VpcId: !Ref VPC
InternetGatewayId: !Ref InternetGateway
Outputs:
VPCId:
Description: VPC ID
Value: !Ref VPC
Export:
Name: !Sub '${EnvironmentName}-VPC-ID'
Azure Resource Manager (ARM) / Bicep
Best for: Azure-only infrastructure, Azure-native tooling
// Example: Bicep - Azure App Service with SQL Database
param location string = resourceGroup().location
param appName string = 'myapp-${uniqueString(resourceGroup().id)}'
resource appServicePlan 'Microsoft.Web/serverfarms@2021-02-01' = {
name: '${appName}-plan'
location: location
sku: {
name: 'B1'
tier: 'Basic'
capacity: 1
}
kind: 'linux'
properties: {
reserved: true
}
}
resource webApp 'Microsoft.Web/sites@2021-02-01' = {
name: appName
location: location
properties: {
serverFarmId: appServicePlan.id
siteConfig: {
linuxFxVersion: 'NODE|18-lts'
appSettings: [
{
name: 'WEBSITE_NODE_DEFAULT_VERSION'
value: '18-lts'
}
]
}
}
}
resource sqlServer 'Microsoft.Sql/servers@2021-11-01' = {
name: '${appName}-sql'
location: location
properties: {
administratorLogin: 'sqladmin'
administratorLoginPassword: 'P@ssw0rd123!' // Use Key Vault in production!
}
}
resource sqlDatabase 'Microsoft.Sql/servers/databases@2021-11-01' = {
parent: sqlServer
name: 'appdb'
location: location
sku: {
name: 'Basic'
tier: 'Basic'
}
}
output webAppUrl string = webApp.properties.defaultHostName
output sqlServerFqdn string = sqlServer.properties.fullyQualifiedDomainName
Google Cloud Deployment Manager
Best for: GCP-only infrastructure
# Example: GCP Deployment Manager - Compute Instance
resources:
- name: web-server
type: compute.v1.instance
properties:
zone: us-central1-a
machineType: zones/us-central1-a/machineTypes/n1-standard-1
disks:
- deviceName: boot
type: PERSISTENT
boot: true
autoDelete: true
initializeParams:
sourceImage: projects/debian-cloud/global/images/family/debian-11
networkInterfaces:
- network: global/networks/default
accessConfigs:
- name: External NAT
type: ONE_TO_ONE_NAT
metadata:
items:
- key: startup-script
value: |
#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl start nginx
systemctl enable nginx
tags:
items:
- http-server
- https-server
IaC Best Practices
1. Version Control Everything
All infrastructure code must be in version control.
# Standard IaC repository structure
infrastructure/
├── README.md
├── .gitignore # Ignore: .terraform/, *.tfstate, secrets
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── production/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
└── modules/
├── networking/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── compute/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── database/
├── main.tf
├── variables.tf
└── outputs.tf
What to ignore:
# Terraform
.terraform/
*.tfstate
*.tfstate.backup
*.tfvars # If contains secrets
# Pulumi
.pulumi/
Pulumi.*.yaml # If contains secrets
# Secrets
*.pem
*.key
secrets.yaml
credentials.json
2. Modular Infrastructure
Create reusable modules for common patterns.
# modules/web-app/main.tf
variable "app_name" {
type = string
}
variable "environment" {
type = string
}
variable "instance_count" {
type = number
default = 2
}
resource "aws_launch_template" "app" {
name_prefix = "${var.app_name}-"
image_id = data.aws_ami.amazon_linux.id
instance_type = "t3.micro"
vpc_security_group_ids = [aws_security_group.app.id]
tag_specifications {
resource_type = "instance"
tags = {
Name = var.app_name
Environment = var.environment
}
}
}
resource "aws_autoscaling_group" "app" {
name = "${var.app_name}-asg"
vpc_zone_identifier = var.subnet_ids
min_size = var.instance_count
max_size = var.instance_count * 2
desired_capacity = var.instance_count
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
target_group_arns = [aws_lb_target_group.app.arn]
}
# Usage in production environment
module "web_app" {
source = "../../modules/web-app"
app_name = "myapp"
environment = "production"
instance_count = 4
subnet_ids = module.networking.private_subnet_ids
}
3. Environment Separation
Separate environments to prevent accidents.
Strategy 1: Separate directories (Terraform)
environments/
├── dev/
│ └── main.tf # Points to dev AWS account
├── staging/
│ └── main.tf # Points to staging AWS account
└── production/
└── main.tf # Points to prod AWS account (requires approval)
Strategy 2: Separate projects (Pulumi)
pulumi/
├── dev/ # Pulumi stack: dev
├── staging/ # Pulumi stack: staging
└── production/ # Pulumi stack: production
Strategy 3: Workspaces (Terraform)
terraform workspace list
default
dev
staging
* production
Environment-specific configuration:
# variables.tf
variable "environment_config" {
type = map(object({
instance_type = string
instance_count = number
db_instance_class = string
}))
default = {
dev = {
instance_type = "t3.micro"
instance_count = 1
db_instance_class = "db.t3.micro"
}
staging = {
instance_type = "t3.small"
instance_count = 2
db_instance_class = "db.t3.small"
}
production = {
instance_type = "t3.medium"
instance_count = 4
db_instance_class = "db.r5.large"
}
}
}
locals {
config = var.environment_config[var.environment]
}
resource "aws_instance" "app" {
instance_type = local.config.instance_type
count = local.config.instance_count
# ...
}
4. State Management
Terraform state must be stored remotely and locked.
# backend.tf - S3 backend with DynamoDB locking
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/myapp/terraform.tfstate"
region = "us-west-2"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Creating the S3 backend resources:
# bootstrap/main.tf - Run this once to create backend
resource "aws_s3_bucket" "terraform_state" {
bucket = "mycompany-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Pulumi state:
# Pulumi automatically manages state in Pulumi Cloud
# Or configure self-hosted backend:
pulumi login s3://my-pulumi-state-bucket
5. Plan Before Apply
Always preview changes before applying.
# Terraform workflow
terraform init # Initialize providers and modules
terraform validate # Validate configuration syntax
terraform fmt # Format code consistently
terraform plan # Preview changes
terraform plan -out=plan.tfplan # Save plan
terraform apply plan.tfplan # Apply saved plan
# Never run: terraform apply -auto-approve (in production)
# Pulumi workflow
pulumi preview # Preview changes
pulumi up # Apply changes (with confirmation)
pulumi up --yes # Auto-approve (CI/CD only)
6. Code Review for Infrastructure
Review infrastructure changes like application code.
# .github/workflows/terraform-pr.yml
name: Terraform PR Check
on:
pull_request:
paths:
- 'infrastructure/**'
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
working-directory: infrastructure/production
- name: Terraform Format Check
run: terraform fmt -check
working-directory: infrastructure/production
- name: Terraform Validate
run: terraform validate
working-directory: infrastructure/production
- name: Terraform Plan
run: terraform plan -no-color
working-directory: infrastructure/production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Comment PR with Plan
uses: actions/github-script@v6
with:
script: |
const output = `#### Terraform Plan
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\`
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
Infrastructure Patterns
Multi-Tier Web Application
Classic 3-tier architecture: Web → Application → Database
# Example: AWS 3-tier architecture with Terraform
module "networking" {
source = "./modules/networking"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
public_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
private_subnet_cidrs = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
db_subnet_cidrs = ["10.0.21.0/24", "10.0.22.0/24", "10.0.23.0/24"]
}
# Web tier - ALB in public subnets
module "web_tier" {
source = "./modules/alb"
name = "web-alb"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.public_subnet_ids
security_groups = [module.networking.web_sg_id]
certificate_arn = aws_acm_certificate.app.arn
}
# Application tier - EC2 instances in private subnets
module "app_tier" {
source = "./modules/ec2-asg"
name = "app-servers"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
security_groups = [module.networking.app_sg_id]
target_group_arns = [module.web_tier.target_group_arn]
min_size = 2
max_size = 10
desired_size = 4
instance_type = "t3.medium"
}
# Database tier - RDS in database subnets
module "db_tier" {
source = "./modules/rds"
identifier = "app-db"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r5.large"
allocated_storage = 100
db_name = "appdb"
username = "dbadmin"
password = data.aws_secretsmanager_secret_version.db_password.secret_string
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.db_subnet_ids
vpc_security_group_ids = [module.networking.db_sg_id]
backup_retention_period = 7
multi_az = true
storage_encrypted = true
}
Kubernetes Cluster Infrastructure
EKS cluster with managed node groups:
# modules/eks/main.tf
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = var.cluster_name
cluster_version = "1.28"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
# Cluster access
cluster_endpoint_public_access = true
cluster_endpoint_private_access = true
# Managed node groups
eks_managed_node_groups = {
general = {
desired_size = 3
min_size = 2
max_size = 10
instance_types = ["t3.medium"]
capacity_type = "ON_DEMAND"
labels = {
role = "general"
}
tags = {
NodeGroup = "general"
}
}
compute = {
desired_size = 2
min_size = 1
max_size = 5
instance_types = ["c5.xlarge"]
capacity_type = "SPOT"
labels = {
role = "compute"
}
taints = [{
key = "workload"
value = "compute"
effect = "NO_SCHEDULE"
}]
}
}
# Cluster addons
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
}
}
}
Serverless Infrastructure
AWS Lambda with API Gateway and DynamoDB:
// Pulumi - TypeScript serverless stack
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
import * as awsx from "@pulumi/awsx";
// DynamoDB table
const table = new aws.dynamodb.Table("app-table", {
attributes: [
{ name: "id", type: "S" },
],
hashKey: "id",
billingMode: "PAY_PER_REQUEST",
});
// Lambda function
const lambda = new aws.lambda.Function("api-handler", {
runtime: aws.lambda.Runtime.NodeJS18dX,
handler: "index.handler",
role: lambdaRole.arn,
code: new pulumi.asset.AssetArchive({
".": new pulumi.asset.FileArchive("./lambda"),
}),
environment: {
variables: {
TABLE_NAME: table.name,
},
},
});
// API Gateway
const api = new awsx.apigateway.API("api", {
routes: [
{
path: "/items",
method: "GET",
eventHandler: lambda,
},
{
path: "/items",
method: "POST",
eventHandler: lambda,
},
{
path: "/items/{id}",
method: "GET",
eventHandler: lambda,
},
],
});
// IAM role for Lambda
const lambdaRole = new aws.iam.Role("lambda-role", {
assumeRolePolicy: aws.iam.assumeRolePolicyForPrincipal({
Service: "lambda.amazonaws.com",
}),
});
// Attach policies
new aws.iam.RolePolicyAttachment("lambda-basic-execution", {
role: lambdaRole,
policyArn: aws.iam.ManagedPolicy.AWSLambdaBasicExecutionRole,
});
new aws.iam.RolePolicy("lambda-dynamodb-policy", {
role: lambdaRole,
policy: pulumi.all([table.arn]).apply(([tableArn]) => JSON.stringify({
Version: "2012-10-17",
Statement: [{
Effect: "Allow",
Action: [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query",
"dynamodb:Scan",
],
Resource: tableArn,
}],
})),
});
export const apiUrl = api.url;
export const tableName = table.name;
Security Best Practices
1. Secrets Management
Never hardcode secrets in IaC.
❌ Bad:
resource "aws_db_instance" "db" {
username = "admin"
password = "SuperSecret123!" # NEVER DO THIS
}
✅ Good: Use secrets manager
# Store secret in AWS Secrets Manager
resource "aws_secretsmanager_secret" "db_password" {
name = "production/db/password"
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = var.db_password # Passed via environment variable
}
# Reference secret in RDS
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
}
resource "aws_db_instance" "db" {
username = "admin"
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
Pulumi with encrypted secrets:
# Pulumi encrypts secrets automatically
pulumi config set --secret dbPassword "SuperSecret123!"
// Access in code
const config = new pulumi.Config();
const dbPassword = config.requireSecret("dbPassword");
2. Least Privilege IAM Policies
Grant minimum required permissions.
# Good: Specific permissions for specific resources
resource "aws_iam_role_policy" "lambda_policy" {
name = "lambda-specific-access"
role = aws_iam_role.lambda.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query"
]
Resource = aws_dynamodb_table.users.arn
},
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "${aws_s3_bucket.uploads.arn}/*"
}
]
})
}
# Bad: Overly permissive
# policy = "arn:aws:iam::aws:policy/AdministratorAccess" # NEVER
3. Network Security
Implement defense in depth.
# Security groups with minimal access
resource "aws_security_group" "app_sg" {
name = "app-servers"
description = "Security group for application servers"
vpc_id = aws_vpc.main.id
# Only allow traffic from ALB
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.alb_sg.id]
}
# Allow outbound traffic
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "db_sg" {
name = "database"
description = "Security group for database"
vpc_id = aws_vpc.main.id
# Only allow traffic from app servers
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app_sg.id]
}
# No outbound internet access needed for DB
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = []
}
}
4. Encryption at Rest and in Transit
# S3 bucket with encryption
resource "aws_s3_bucket" "data" {
bucket = "app-data-bucket"
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.s3.id
}
}
}
# RDS with encryption
resource "aws_db_instance" "db" {
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
# ...
}
# EBS volumes encrypted by default
resource "aws_ebs_encryption_by_default" "enabled" {
enabled = true
}
Cost Management
1. Resource Tagging
Tag all resources for cost allocation.
# Standard tags for all resources
locals {
common_tags = {
Environment = var.environment
Project = "myapp"
ManagedBy = "terraform"
CostCenter = "engineering"
Owner = "devops-team"
}
}
resource "aws_instance" "app" {
instance_type = "t3.medium"
ami = data.aws_ami.amazon_linux.id
tags = merge(
local.common_tags,
{
Name = "app-server-${count.index + 1}"
Role = "application"
}
)
}
2. Right-Sizing Resources
Use appropriate instance sizes.
# Bad: Over-provisioned
resource "aws_instance" "app" {
instance_type = "m5.4xlarge" # 16 vCPU, 64 GB RAM
# App only uses 2 vCPU, 4 GB RAM
}
# Good: Right-sized with auto-scaling
resource "aws_autoscaling_group" "app" {
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
min_size = 2
max_size = 10
desired_capacity = 4
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 70.0
}
}
3. Spot Instances for Non-Critical Workloads
resource "aws_launch_template" "batch_processing" {
instance_type = "c5.xlarge"
instance_market_options {
market_type = "spot"
spot_options {
max_price = "0.10" # 50-70% cost savings
}
}
}
4. Scheduled Scaling
Scale down during off-hours.
# Scale down non-production at night
resource "aws_autoscaling_schedule" "scale_down" {
count = var.environment != "production" ? 1 : 0
scheduled_action_name = "scale-down-night"
min_size = 0
max_size = 0
desired_capacity = 0
recurrence = "0 22 * * *" # 10 PM daily
autoscaling_group_name = aws_autoscaling_group.app.name
}
resource "aws_autoscaling_schedule" "scale_up" {
count = var.environment != "production" ? 1 : 0
scheduled_action_name = "scale-up-morning"
min_size = 2
max_size = 10
desired_capacity = 2
recurrence = "0 8 * * 1-5" # 8 AM weekdays
autoscaling_group_name = aws_autoscaling_group.app.name
}
Disaster Recovery
Backup Strategy
# RDS automated backups
resource "aws_db_instance" "db" {
backup_retention_period = 7
backup_window = "03:00-04:00" # UTC
maintenance_window = "sun:04:00-sun:05:00"
# Enable automated backups to S3
copy_tags_to_snapshot = true
# Multi-AZ for HA
multi_az = true
}
# S3 versioning for data protection
resource "aws_s3_bucket_versioning" "data" {
bucket = aws_s3_bucket.data.id
versioning_configuration {
status = "Enabled"
}
}
# S3 lifecycle policy for old versions
resource "aws_s3_bucket_lifecycle_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
id = "archive-old-versions"
status = "Enabled"
noncurrent_version_transition {
noncurrent_days = 30
storage_class = "GLACIER"
}
noncurrent_version_expiration {
noncurrent_days = 90
}
}
}
# AWS Backup for automated backups
resource "aws_backup_plan" "main" {
name = "daily-backup-plan"
rule {
rule_name = "daily-backups"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 2 * * ? *)" # 2 AM daily
lifecycle {
delete_after = 30
}
}
}
resource "aws_backup_selection" "main" {
plan_id = aws_backup_plan.main.id
name = "backup-resources"
iam_role_arn = aws_iam_role.backup.arn
selection_tag {
type = "STRINGEQUALS"
key = "Backup"
value = "true"
}
}
Multi-Region Deployment
# Provider configuration for multi-region
provider "aws" {
alias = "primary"
region = "us-west-2"
}
provider "aws" {
alias = "secondary"
region = "us-east-1"
}
# Primary region resources
module "primary_region" {
source = "./modules/regional-infrastructure"
providers = {
aws = aws.primary
}
region = "us-west-2"
environment = var.environment
}
# Secondary region resources
module "secondary_region" {
source = "./modules/regional-infrastructure"
providers = {
aws = aws.secondary
}
region = "us-east-1"
environment = var.environment
}
# Route53 health checks with failover
resource "aws_route53_health_check" "primary" {
fqdn = module.primary_region.alb_dns_name
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
}
resource "aws_route53_record" "app" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
set_identifier = "primary"
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
alias {
name = module.primary_region.alb_dns_name
zone_id = module.primary_region.alb_zone_id
evaluate_target_health = true
}
}
Infrastructure Testing
1. Validation Tests
# Terraform validation
terraform validate
terraform fmt -check
terraform plan -detailed-exitcode # Exit code 2 if changes needed
2. Policy as Code (Sentinel/OPA)
# Open Policy Agent (OPA) policy
package terraform.required_tags
required_tags = ["Environment", "Project", "ManagedBy"]
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
tags := resource.change.after.tags
missing_tags := [tag | tag := required_tags[_]; not tags[tag]]
count(missing_tags) > 0
msg := sprintf("Instance %s is missing required tags: %v",
[resource.address, missing_tags])
}
3. Integration Tests (Terratest)
// terratest example
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestTerraformWebApp(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/web-app",
Vars: map[string]interface{}{
"environment": "test",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Validate outputs
albDnsName := terraform.Output(t, terraformOptions, "alb_dns_name")
assert.NotEmpty(t, albDnsName)
// Test ALB responds
url := fmt.Sprintf("http://%s/health", albDnsName)
http.Get(url) // Should return 200 OK
}
Terraform Workflow
Daily Workflow
# 1. Pull latest changes
git pull origin main
# 2. Initialize (if first time or modules changed)
terraform init
# 3. Select workspace (if using workspaces)
terraform workspace select production
# 4. Plan changes
terraform plan -out=tfplan
# 5. Review plan carefully
# Check: What's being created, modified, destroyed?
# 6. Apply if plan looks good
terraform apply tfplan
# 7. Commit state lock (if not using remote backend)
git add terraform.tfstate
git commit -m "chore: update terraform state"
git push
Making Changes
# 1. Create feature branch
git checkout -b infra/add-cache-layer
# 2. Make infrastructure changes
vim main.tf
# 3. Format code
terraform fmt -recursive
# 4. Validate
terraform validate
# 5. Plan
terraform plan
# 6. Commit changes
git add .
git commit -m "feat: add Redis cache layer"
git push origin infra/add-cache-layer
# 7. Create PR for review
gh pr create --title "Add Redis cache layer" --body "Adds ElastiCache Redis for session storage"
# 8. After approval, apply
terraform apply
# 9. Merge PR
gh pr merge
Quick Reference
Infrastructure Checklist
Before deploying infrastructure:
- Code in version control
- Remote state configured (S3/Cloud Storage)
- State locking enabled (DynamoDB/Cloud Storage)
- All secrets in secrets manager (not hardcoded)
- Resources tagged appropriately
- IAM policies follow least privilege
- Encryption enabled (at rest and in transit)
- Backups configured
- Multi-AZ/region if needed for HA
- Cost estimates reviewed
- Plan reviewed by team
- Runbook documented for changes
Common Terraform Commands
# Initialization
terraform init # Initialize working directory
terraform init -upgrade # Upgrade providers
# Planning
terraform plan # Show execution plan
terraform plan -out=plan.tfplan # Save plan to file
terraform show plan.tfplan # View saved plan
# Applying
terraform apply # Apply changes
terraform apply plan.tfplan # Apply saved plan
terraform apply -auto-approve # Skip confirmation (CI/CD)
# State
terraform state list # List resources in state
terraform state show aws_instance.web # Show resource details
terraform state mv # Move resource in state
terraform state rm # Remove resource from state
# Outputs
terraform output # Show all outputs
terraform output alb_dns_name # Show specific output
# Workspaces
terraform workspace list # List workspaces
terraform workspace new prod # Create workspace
terraform workspace select prod # Switch workspace
# Maintenance
terraform fmt -recursive # Format all files
terraform validate # Validate configuration
terraform refresh # Update state with real resources
terraform destroy # Destroy all resources
Common Pulumi Commands
# Project setup
pulumi new aws-typescript # Create new project
pulumi stack init production # Create new stack
# Deployment
pulumi preview # Preview changes
pulumi up # Deploy changes
pulumi up --yes # Deploy without confirmation
# State
pulumi stack output # Show all outputs
pulumi stack output apiUrl # Show specific output
pulumi stack export > state.json # Export state
pulumi stack import < state.json # Import state
# Stack management
pulumi stack ls # List stacks
pulumi stack select production # Switch stack
pulumi stack rm dev # Delete stack
# Configuration
pulumi config set aws:region us-west-2 # Set config
pulumi config set --secret dbPassword pass123 # Set secret
# Maintenance
pulumi refresh # Sync state with real resources
pulumi destroy # Destroy all resources
Integration with Other Skills
With Deployment
- Infrastructure provisioned before application deployment
- IaC creates compute, networking, storage resources
- Deployment tools (kubectl, docker) deploy applications to infrastructure
- Infrastructure changes deployed separately from application code
With CI/CD
- Infrastructure changes tested in CI pipeline
- Terraform/Pulumi plan runs on every PR
- Automated deployment to dev/staging environments
- Manual approval required for production infrastructure changes
- State drift detection in CI
With Monitoring
- Infrastructure metrics collected (CPU, memory, disk, network)
- Cloud provider metrics integrated (CloudWatch, Azure Monitor)
- Infrastructure alerts for resource exhaustion
- Cost monitoring and budget alerts
With Security
- IAM policies defined as code
- Security groups and network ACLs in IaC
- Secrets stored in secrets management services
- Compliance scanning for infrastructure code (Checkov, tfsec)
Remember: Infrastructure is code. Treat it with the same discipline as application code: version control, code review, automated testing, and continuous deployment. Never make manual changes in production—always update the code and redeploy. Infrastructure drift is a bug.