DevOps

Terraform Best Practices for Multi-Cloud Infrastructure

December 5, 2024
8 min read
Amar Sohail
Terraform Best Practices for Multi-Cloud Infrastructure
TerraformAWSAzureGCPCloud NativeCI/CD PipelinesAutomationFinOps for AI

TL;DR

I will be honest: nobody sets out to be multi-cloud on purpose. For us, it happened organically. Our primary workloads ran on AWS, but an acquisition brought in a team deeply invested in Azure, and our ML/AI pipeline team had standardized on GCP's Vertex AI. Within six months, we were managing infrastructure across three clouds with a patchwork of ClickOps, cloud-specific CLIs, and a handful of CloudFormation templates that nobody wanted to touch.

The Multi-Cloud Reality

I will be honest: nobody sets out to be multi-cloud on purpose. For us, it happened organically. Our primary workloads ran on AWS, but an acquisition brought in a team deeply invested in Azure, and our ML/AI pipeline team had standardized on GCP's Vertex AI. Within six months, we were managing infrastructure across three clouds with a patchwork of ClickOps, cloud-specific CLIs, and a handful of CloudFormation templates that nobody wanted to touch.

When I took ownership of our infrastructure platform, the first decision was straightforward: Terraform would be our single control plane. Not Pulumi, not Crossplane, not cloud-native IaC tools. Terraform, because it had the broadest provider ecosystem and the largest pool of engineers who already knew it.

What follows is not a Terraform tutorial. It is the set of patterns and guardrails that let a team of 8 platform engineers manage 1,400+ resources across three clouds without losing their minds.

Module Design: The Layered Approach

Early on, we made the classic mistake of writing monolithic Terraform configurations -- one massive main.tf per environment with hundreds of resources. Changing a security group rule required a terraform plan that took 9 minutes and touched resources it had no business evaluating.

We restructured around a three-layer module architecture:

Layer 1: Primitive Modules

These wrap individual cloud resources with our organizational defaults baked in. They are thin, opinionated, and cloud-specific.

# modules/aws/vpc/main.tf
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = var.vpc_name
  cidr = var.cidr_block

  azs             = var.availability_zones
  private_subnets = var.private_subnet_cidrs
  public_subnets  = var.public_subnet_cidrs

  enable_nat_gateway     = true
  single_nat_gateway     = var.environment == "dev" ? true : false
  enable_dns_hostnames   = true
  enable_dns_support     = true

  tags = merge(var.tags, {
    ManagedBy   = "terraform"
    Environment = var.environment
    CostCenter  = var.cost_center  # FinOps tag — mandatory
  })
}

Notice the CostCenter tag. We enforce this through a Terraform validation rule -- no resource can be created without a cost allocation tag. This was a FinOps decision that paid for itself within two months when we discovered a dev environment running GPU instances that were costing us $14,000/month.

Layer 2: Composite Modules

These compose primitive modules into logical platform components. A "backend service" module, for example, provisions a VPC, an EKS node group, an RDS instance, and the IAM roles to glue them together.

# modules/platform/backend-service/main.tf
module "networking" {
  source      = "../../aws/vpc"
  vpc_name    = "${var.service_name}-${var.environment}"
  cidr_block  = var.cidr_block
  environment = var.environment
  cost_center = var.cost_center
  tags        = local.common_tags
}

module "database" {
  source              = "../../aws/rds-postgres"
  instance_class      = var.db_instance_class
  allocated_storage   = var.db_storage_gb
  subnet_ids          = module.networking.private_subnet_ids
  vpc_security_group_ids = [module.networking.database_sg_id]
  environment         = var.environment
  cost_center         = var.cost_center
}

module "kubernetes" {
  source          = "../../aws/eks-nodegroup"
  cluster_name    = var.eks_cluster_name
  node_group_name = var.service_name
  subnet_ids      = module.networking.private_subnet_ids
  instance_types  = var.node_instance_types
  desired_size    = var.environment == "prod" ? 3 : 1
  cost_center     = var.cost_center
}

Layer 3: Environment Configurations

These are the actual deployments -- thin wrappers that call composite modules with environment-specific variables. Almost no resource definitions live here, just module calls and variable assignments.

# environments/production/us-east-1/backend/main.tf
module "order_service" {
  source = "../../../../modules/platform/backend-service"

  service_name     = "order-service"
  environment      = "prod"
  cost_center      = "platform-eng"
  cidr_block       = "10.1.0.0/16"
  eks_cluster_name = "prod-us-east-1"
  db_instance_class = "db.r6g.xlarge"
  db_storage_gb     = 500
  node_instance_types = ["m6i.xlarge"]
}

This layering means a developer deploying a new service does not need to understand VPC CIDR planning or IAM policy syntax. They fill in a module call, open a PR, and the platform team reviews the plan output.

State Management: The One That Bites Everyone

Terraform state is where most multi-cloud setups fall apart. We went through three state management strategies before finding one that works at scale.

What failed: A single S3 backend with path-based workspaces. State file locking conflicts were constant, plan times ballooned as the state grew, and a corrupted state file for one service could block deployments for everything.

What works: One state file per deployment unit, stored in the same cloud as the resources it manages.

# environments/production/us-east-1/backend/backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state-prod"
    key            = "us-east-1/backend/order-service/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
# environments/production/azure-eastus/ml-pipeline/backend.tf
terraform {
  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "companyterraformstate"
    container_name       = "tfstate"
    key                  = "azure-eastus/ml-pipeline/terraform.tfstate"
  }
}

Each cloud's resources have their state stored in that cloud's native storage. AWS resources use S3, Azure resources use Azure Blob Storage, GCP resources use GCS. This eliminates cross-cloud dependencies in the state layer and means a cloud provider outage only affects deployments to that cloud.

We currently manage around 60 independent state files. That sounds like a lot, but each one is small, fast to plan, and independently lockable. Our average terraform plan dropped from 9 minutes to under 40 seconds.

CI/CD Pipeline: The Guardrails

Our CI/CD pipeline for Terraform runs in GitHub Actions, and we treat terraform apply like a production deployment -- because it is one.

# .github/workflows/terraform.yml
name: Terraform
on:
  pull_request:
    paths: ["environments/**", "modules/**"]
  push:
    branches: [main]
    paths: ["environments/**", "modules/**"]

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      changed_dirs: ${{ steps.changes.outputs.dirs }}
    steps:
      - uses: actions/checkout@v4
      - id: changes
        run: |
          dirs=$(git diff --name-only origin/main... | \
            grep '^environments/' | \
            xargs -I {} dirname {} | \
            sort -u | jq -R -s -c 'split("\n") | map(select(. != ""))')
          echo "dirs=$dirs" >> $GITHUB_OUTPUT

  plan:
    needs: detect-changes
    runs-on: ubuntu-latest
    strategy:
      matrix:
        dir: ${{ fromJson(needs.detect-changes.outputs.changed_dirs) }}
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
        working-directory: ${{ matrix.dir }}
      - run: terraform plan -out=tfplan -no-color
        working-directory: ${{ matrix.dir }}
      - uses: actions/upload-artifact@v4
        with:
          name: plan-${{ hashFiles(matrix.dir) }}
          path: ${{ matrix.dir }}/tfplan

Key guardrails in this pipeline:

  1. Change detection scopes plans. We only run terraform plan on directories that actually changed. This keeps PR feedback loops under 2 minutes.
  2. Plan artifacts are saved and reused for apply. The apply job downloads the exact plan artifact that was reviewed, so there is no drift between what was approved and what gets applied.
  3. No manual applies. Engineers cannot run terraform apply from their laptops. This is enforced by IAM policies that restrict write permissions to the CI service account.

FinOps: Making Costs Visible Before They Hit the Bill

This is where most infrastructure teams drop the ball. Writing Terraform to provision resources is easy. Understanding what those resources will cost before you provision them is where real value lies.

We integrated Infracost into our CI pipeline so every PR that modifies infrastructure shows a cost estimate in the PR comment:

# We also use Terraform variables with validation to prevent cost blowups
variable "db_instance_class" {
  type = string
  validation {
    condition = contains([
      "db.t3.medium", "db.t3.large",
      "db.r6g.large", "db.r6g.xlarge", "db.r6g.2xlarge"
    ], var.db_instance_class)
    error_message = "Instance class not in approved list. File a request in #platform-eng for exceptions."
  }
}

variable "node_instance_types" {
  type = list(string)
  validation {
    condition = alltrue([
      for t in var.node_instance_types :
      !can(regex("^(p[0-9]|g[0-9]|x[0-9])", t))
    ])
    error_message = "GPU and high-memory instances require FinOps approval. See runbook/finops-exceptions."
  }
}

That second validation rule prevents anyone from accidentally provisioning GPU instances (the p, g, and x families) without going through our FinOps approval process. This single validation saved us over $40,000 in the first quarter after we added it, mostly from dev/staging environments where engineers were testing ML workloads on p3.2xlarge instances and forgetting to tear them down.

Cross-Cloud Networking: The Hard Part

The least-discussed challenge in multi-cloud Terraform is networking. Our AWS workloads need to talk to Azure-hosted ML services, and our GCP-hosted AI training pipelines need to pull data from AWS S3.

We standardized on a hub-and-spoke model where each cloud has a transit gateway or equivalent, and cross-cloud connectivity goes through dedicated VPN tunnels managed by Terraform:

# modules/cross-cloud/aws-azure-vpn/main.tf
resource "aws_vpn_gateway" "main" {
  vpc_id = var.aws_vpc_id
  tags   = { Name = "vpn-to-azure-${var.environment}" }
}

resource "azurerm_virtual_network_gateway" "main" {
  name                = "vpn-to-aws-${var.environment}"
  location            = var.azure_location
  resource_group_name = var.azure_resource_group

  type     = "Vpn"
  vpn_type = "RouteBased"
  sku      = "VpnGw2"

  ip_configuration {
    public_ip_address_id = azurerm_public_ip.vpn.id
    subnet_id            = var.azure_gateway_subnet_id
  }
}

The trick is managing the shared secrets and IP allocations across providers. We store VPN pre-shared keys in HashiCorp Vault and reference them through the Vault Terraform provider, never in state files or variable definitions.

Lessons After 18 Months

Three things I wish I had known at the start:

Module versioning is essential, not optional. We use a private Terraform module registry backed by our Git monorepo. Every module has a semantic version, and environment configurations pin to specific versions. Unpinned modules are a ticking time bomb.

terraform import is your migration best friend. When we brought the Azure infrastructure under Terraform management, we imported existing resources rather than recreating them. It took two weeks of tedious import commands, but zero downtime.

Invest in developer experience. We built a CLI wrapper called infra that handles backend initialization, workspace selection, and plan formatting. Engineers run infra plan order-service prod instead of navigating directory trees and running raw Terraform commands. Adoption tripled after we shipped it.

Multi-cloud Terraform is a marathon, not a sprint. The patterns I have described here took us 18 months to refine, and we are still iterating. If you are running workloads on Kubernetes across these clouds, I wrote about the operational patterns we use in our Kubernetes scaling post. And if your event-driven systems are part of what you are provisioning, our approach to Kafka infrastructure management covers how the application and infrastructure layers connect.

Related Posts