Terraform State Strategy for Multi-Cloud Teams

State strategy is the part of Terraform that teams get wrong early and pay for late. The mistakes compound — a bad state layout at week two is a painful refactor at month twelve when the team has grown and the infrastructure has gotten more complex.

I've managed Terraform state across OCI, AWS, and Azure simultaneously. What follows is the strategy I'd apply from day one if I were starting again, and the specific failure patterns I've seen in production that I'd want to avoid.

Why state matters more than people expect

Terraform state is not a cache. It's the authoritative record of what Terraform believes it has created. When state diverges from reality — through manual changes, failed applies, or state file corruption — you lose the ability to reason about your infrastructure.

In a single-cloud, single-team setup, you can paper over state problems. In multi-cloud, you can't. You have multiple backend configurations, multiple teams touching different parts of the infrastructure, and cross-cloud dependencies that don't fit neatly into one state file. The surface area for state problems is just larger.

The three most common failure modes I've seen:

Collision on shared state. Two engineers run terraform apply against the same state file concurrently. Without locking, one apply overwrites the other's state. Most backends support locking (S3 with DynamoDB, OCI Object Storage with a custom locking mechanism, Azure Blob Storage with lease-based locking) but I've seen teams not configure it because it seemed like extra setup. It isn't optional.

Blast radius from monolithic state. A single state file for "all infrastructure" means a mistake in the networking module can lock the entire state while you fix it. Everything in that file is blocked. Split state files exist precisely to limit blast radius.

Promotion debt. When dev, staging, and prod share a state file (or share module versions without pinning), a change promoted to dev that breaks something takes prod state down with it. Environments should have strict boundaries.

The layout that scales

The layout I reach for starts with this directory structure:

infrastructure/
  modules/          # reusable, cloud-specific modules
    oci/
      network/
      kubernetes/
      iam/
    aws/
      vpc/
      eks/
      iam/
    azure/
      vnet/
      aks/
      rbac/
  environments/
    dev/
      oci/
      aws/
      azure/
    staging/
      oci/
      aws/
      azure/
    prod/
      oci/
      aws/
      azure/

Each environments/[]/[]/ directory is a separate Terraform root. Separate state file, separate backend configuration, separate apply scope.

# environments/prod/oci/backend.tf
terraform {
  backend "s3" {
    # OCI Object Storage with S3-compatible API
    bucket   = "tf-state-prod"
    key      = "oci/prod.tfstate"
    region   = "ap-mumbai-1"
    endpoint = "https://[].compat.objectstorage.ap-mumbai-1.oraclecloud.com"

    skip_credentials_validation = true
    skip_metadata_api_check     = true
    skip_region_validation      = true
    force_path_style            = true
  }
}

This config works with Terraform 1.6+ and the OCI provider. The four skip_* flags are required — without them, Terraform tries to validate the endpoint against AWS conventions and fails.

For AWS the backend is straightforward S3 + DynamoDB. For Azure it's Azure Blob Storage with storage account-level locking via lease acquisition. The principle is the same across all three: remote state in a managed storage service, locking enabled, versioning enabled.

Module versioning across clouds

The second decision is how to version modules. In a multi-cloud setup, you typically have cloud-specific modules (an OCI network module, an AWS VPC module) and shared modules (a tagging convention module, a monitoring stack module). Version pinning keeps these from surprising you.

I use a private Terraform registry backed by a simple artifact store, or module source pinned to a git tag when the registry overhead isn't justified:

module "oci_network" {
  source  = "git::https://github.com/your-org/tf-modules.git//oci/network?ref=v1.4.2"

  vcn_cidr_block = "10.0.0.0/16"
  subnets        = var.subnets
}

The ?ref=v1.4.2 pin is the critical part. Without it, any push to that module path affects every environment using it on the next terraform init. That's how "I just fixed a small thing in the module" turns into an unintended change in production.

When updating a module version, the promotion path is dev first, staging second, prod third — with a plan review at each stage.

State per bounded domain, not per team

The failure mode I've seen most in large teams is letting state file ownership follow team boundaries rather than infrastructure domain boundaries. The networking team owns the networking state, the platform team owns the Kubernetes state, the security team owns the IAM state. Seems logical.

The problem: cross-domain dependencies. The Kubernetes state needs a subnet ID from the networking state. The IAM state needs a cluster principal from the Kubernetes state. Terraform handles this with terraform_remote_state data sources:

data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "tf-state-prod"
    key    = "oci/network/prod.tfstate"
    region = "ap-mumbai-1"
    # ... same backend config
  }
}

resource "oci_containerengine_cluster" "main" {
  # ...
  vcn_id = data.terraform_remote_state.network.outputs.vcn_id
}

The terraform_remote_state approach works but creates coupling between state files. A breaking change to an output in the network state can block an apply in the Kubernetes state. That coupling should be explicit and managed — document the output contract, version outputs alongside modules, and treat breaking output changes like API breaking changes.

An alternative that avoids the coupling: store cross-domain values in a parameter store (OCI Vault, AWS SSM Parameter Store, Azure Key Vault) and read them directly rather than through remote state. This adds an operational step (the value has to be written somewhere) but removes the direct state dependency.

CI/CD for Terraform: the guardrails that matter

The four controls that actually catch problems before they reach production:

Plan review on every PR. terraform plan runs in CI on every pull request touching infrastructure code, and the plan output is posted as a PR comment. Engineers review the plan before merge, not after. In my current engagements this runs on Jenkins with a Terraform plugin — the same pattern applies on GitHub Actions with terraform-github-actions, the mechanics differ but the principle doesn't.

Strict apply permissions. The CI/CD service account that runs terraform apply has the minimum permissions needed — and those permissions differ by environment. The dev account can apply freely. The staging account requires a manual approval gate. The prod account requires two approvals and is time-gated (applies only permitted during business hours). Over-broad apply permissions in CI are an incident waiting to happen.

Drift detection. A scheduled pipeline runs terraform plan against production state on a daily cadence and alerts if the plan output is non-empty. Manual changes that bypassed Terraform show up as drift. The alert isn't "fix it now" — it's "someone changed something outside the process, let's understand why."

State backup and versioning. Backend versioning means every state file change is captured. In practice: OCI Object Storage versioning, S3 versioning, Azure Blob Storage versioning — all enabled, with a retention policy. Recovering from accidental state deletion without versioning is painful; with it, it's a restore operation.

The workspace question

Terraform workspaces get recommended as a solution for environment separation. I don't use them for that.

Workspaces share a backend configuration and a module tree. The dev and prod workspaces in a single root share the same main.tf. That makes it easy to accidentally apply prod-intended changes to dev, or vice versa — a single terraform workspace select wrong_env && terraform apply command away from a mistake. Separate roots with separate state files are less convenient but much harder to accidentally cross.

Workspaces are useful for short-lived feature environments spun up and torn down as part of a preview deployment workflow. For long-lived dev/staging/prod separation, the separate-root pattern is more explicit and safer.

What to do when state is already a mess

If you're inheriting a Terraform codebase with monolithic state and no environment separation, the path forward is incremental:

Enable locking and versioning on the existing backend immediately. This is non-breaking and stops the bleeding.
Use terraform state mv to migrate resources into smaller state files. Do this for one domain at a time.
Add module version pins to whatever is currently unpinned. Commit to a specific version before the next change.
Introduce the plan-in-CI pattern even if the existing code is imperfect. A bad plan review is still better than no plan review.

The temptation is to stop everything and do a clean rewrite. That almost never works — the "clean rewrite" takes months, the existing infrastructure still needs to change during that time, and you end up maintaining two parallel systems. Incremental migration is slower but survivable.

The boring answer to Terraform state strategy is: use separate roots per environment per domain, lock everything, version everything, and make plan review mandatory in CI. None of this is novel. The teams I've seen get into trouble are almost always the ones who skipped one of these four things because it felt like overhead before the team was big enough to need it.

The team is always big enough to need it sooner than you expect.