Why I Wrote a Go CLI Instead of Bash for Kubernetes Provisioning (And What Idempotency Actually Required)

The platform team was managing Kubernetes clusters across three environments on a mix of OCI Kubernetes Engine and AWS EKS. Provisioning a new cluster was a multi-hour manual process: Terraform runs, post-provisioning kubectl configuration steps, add-on installation in a specific sequence, environment-specific variable injection. There was a runbook. It was perpetually out of date.

The specific incident that changed my mind about whether to fix the runbook or replace the process: a staging cluster had drifted from the runbook over several weeks of incremental manual changes. When a production incident required spinning up a temporary cluster to reproduce the issue, the team discovered the runbook no longer matched reality. The reproduction took several hours longer than it should have because every step needed manual verification. When I looked at what had accumulated in the provisioning bash script at that point, it was several hundred lines with conditional branches for cloud, environment, and optional component sets. The right response was not a better bash script.

Why Go and not a better shell script

The argument for staying in bash: the team already knows bash, the existing script is mostly working, and converting to Go is weeks of work for something that works today.

The argument for Go: bash conditional logic does not compose cleanly once you are past a few hundred lines. You end up with nested if/else, case statements with fall-through, and functions that return exit codes that the caller has to check correctly every time. The type system is whatever discipline you impose on yourself. Error handling is either checking $? after every command or using set -e and losing visibility into which command failed. Debugging a failed run means reading through shell execution trace output.

The provisioning workflow had conditional logic that was only going to grow: different add-on sets per cloud, environment-specific resource sizes, optional monitoring components, cluster networking choices that depended on the target environment's existing VPC configuration. That complexity is manageable in Go because the type system and error handling make each conditional path explicit and testable. It is not manageable in bash at that scale.

I also wanted something that could be tested. Bash scripts are difficult to unit test in any practical sense. The Go CLI has a test suite that covers the conditional provisioning paths and the idempotency checks.

What the CLI wraps

The CLI is three things composed together:

Terraform for infrastructure. Cluster creation, node pool configuration, VPC/VCN setup, IAM roles. The CLI calls Terraform programmatically via the hashicorp/terraform-exec Go library rather than shelling out to the terraform binary. This gives structured error handling and output parsing rather than scraping stdout.

Cloud SDKs for post-provisioning. Some cluster configuration steps cannot be done cleanly through Terraform because they depend on state that does not exist until the cluster is running. Node pool tagging on OCI, OIDC provider registration on AWS, kubeconfig generation. These use the OCI Go SDK and the AWS SDK for Go (v2).

Kubernetes client-go for add-on installation. Cert-manager, the cluster autoscaler, the metrics server, the ingress controller -- these install via the Kubernetes API rather than Terraform because their lifecycle should be managed separately from the cluster infrastructure. The k8s.io/client-go library handles the API calls.

The CLI binary is the orchestrator. It knows the order: Terraform first, cloud SDK configuration second, Kubernetes add-ons third. Each phase waits for the previous phase to succeed before starting.

What idempotency actually required

This was the hardest design constraint to get right. The goal: running the provision command on an existing cluster should either complete without changes (if the cluster is already in the target state) or apply only the delta. It should never destroy and recreate a resource that already exists in the target configuration.

Terraform handles idempotency for the infrastructure layer -- terraform apply is idempotent by design. The Kubernetes add-on installation layer required explicit handling.

The naive approach: call kubectl apply -f manifest.yaml for each add-on. This is mostly idempotent because apply uses server-side apply semantics and will not recreate resources that already match the desired state. The problem cases: CRDs that need to be installed before the deployment that uses them, add-ons with Helm charts that have hooks, version upgrades where the existing resource needs annotation changes before the upgrade will succeed.

The approach I ended up with: every add-on installation is preceded by an existence and version check. If the add-on is not installed, install it. If the add-on is installed at the target version, skip it. If the add-on is installed at a different version, apply the upgrade path for that specific version delta.

type AddOnState int

const (
    AddOnNotInstalled AddOnState = iota
    AddOnInstalledCurrentVersion
    AddOnInstalledOutdatedVersion
    AddOnInstalledIncompatibleVersion
)

func checkAddOnState(ctx context.Context, client *kubernetes.Clientset, addon AddOn) (AddOnState, error) {
    deployment, err := client.AppsV1().Deployments(addon.Namespace).Get(
        ctx, addon.DeploymentName, metav1.GetOptions{},
    )
    if errors.IsNotFound(err) {
        return AddOnNotInstalled, nil
    }
    if err != nil {
        return 0, fmt.Errorf("checking addon %s: %w", addon.Name, err)
    }

    currentVersion, ok := deployment.Labels["app.kubernetes.io/version"]
    if !ok {
        return AddOnInstalledOutdatedVersion, nil
    }
    if currentVersion == addon.TargetVersion {
        return AddOnInstalledCurrentVersion, nil
    }

    compatible, err := isVersionCompatible(currentVersion, addon.TargetVersion)
    if err != nil {
        return 0, err
    }
    if compatible {
        return AddOnInstalledOutdatedVersion, nil
    }
    return AddOnInstalledIncompatibleVersion, nil
}

func provisionAddOn(ctx context.Context, client *kubernetes.Clientset, addon AddOn) error {
    state, err := checkAddOnState(ctx, client, addon)
    if err != nil {
        return err
    }
    switch state {
    case AddOnNotInstalled:
        return installAddOn(ctx, client, addon)
    case AddOnInstalledCurrentVersion:
        return nil // already at target state, no-op
    case AddOnInstalledOutdatedVersion:
        return upgradeAddOn(ctx, client, addon)
    case AddOnInstalledIncompatibleVersion:
        return fmt.Errorf(
            "addon %s at version %s is incompatible with target %s: manual intervention required",
            addon.Name, addon.InstalledVersion, addon.TargetVersion,
        )
    }
    return nil
}

The AddOnInstalledIncompatibleVersion path does not attempt an automated upgrade. Incompatible version upgrades (major version changes, breaking API changes) require human review because the failure modes are cluster-disrupting. The CLI surfaces the incompatibility and exits cleanly rather than attempting something that might partially succeed.

This state-check-before-act pattern is what makes the CLI safe to re-run after a partial failure. If Terraform succeeded but cert-manager installation failed, re-running the CLI skips the Terraform phase (no changes) and retries cert-manager installation from the exact point of failure.

The OCI and AWS divergence

The two clouds require different handling in several places that cannot be abstracted cleanly.

Node pool configuration. OCI Kubernetes Engine node pools use shapes (VM.Standard.E4.Flex, BM.Standard.E5.128, etc.) with explicit OCPU and memory specifications. AWS EKS node groups use instance types (m6i.xlarge, c6i.2xlarge, etc.) with fixed CPU/memory ratios. The CLI configuration model uses a normalized spec (cpu, memory, count) that maps to the appropriate cloud-specific resource. The mapping is in a configuration file rather than hardcoded, which allows adding new shape/instance type mappings without recompiling.

Kubeconfig generation. OCI uses oci ce cluster create-kubeconfig with a cluster OCID. AWS uses aws eks update-kubeconfig with a cluster name and region. Both produce a kubeconfig entry, but the authentication token mechanism is different: OCI uses OCI token-based auth, AWS uses the aws eks get-token mechanism via an exec credential plugin. The CLI handles both paths by detecting the cloud provider from the cluster configuration and calling the appropriate SDK method.

IAM for pod identity. Kubernetes workloads that need cloud credentials follow different patterns: OCI Workload Identity uses dynamic groups and IAM policies mapped to Kubernetes service accounts. AWS uses IRSA (IAM Roles for Service Accounts) with an OIDC provider. The add-on installation step for any add-on that requires cloud credentials includes the appropriate IAM configuration for the target cloud.

What the numbers looked like

Before the CLI: 3-4 hours for a standard environment provisioning in the common case. After: 12-15 minutes. The variance exists because complex environments with non-standard add-on configurations still require manual review at decision points that the CLI surfaces but does not resolve automatically.

The improvement that mattered more operationally than the average case: provisioning a temporary cluster for incident response dropped from hours to minutes. The common case improvement reduces waste in planned work. The incident response improvement reduces pressure during production incidents, which has a different cost profile.

The CLI has been used to provision 11 clusters since deployment without a provisioning failure that required manual intervention. Two runs produced compatibility errors on add-on upgrades that required human review, which is the intended behavior.

What the CLI does not handle and why

Kubernetes version upgrades across a fleet are intentionally outside the CLI scope. An upgrade is modifying a running system rather than creating from a known state. The failure modes are more consequential (a botched upgrade can take a production cluster down, not just a new provisioning attempt), and the state-check logic for "is this cluster ready to upgrade from 1.29 to 1.30" is significantly more complex than "does this add-on exist at this version."

I have a prototype of upgrade handling for the OCI path. It is not in the main CLI because the test coverage is insufficient. Running cluster upgrades without sufficient test coverage of the version-specific upgrade paths is a way to cause production incidents. I would rather have a manual process with good documentation than an automated one I do not trust on production clusters.

The Go CLI described here is open source at github.com/mohak72/GP-k8s. The full case study with results is on the case studies page.