Kubernetes Cost Optimization Playbook

Most Kubernetes cost optimization advice is vague. "Right-size your pods." "Use spot instances." These are directionally correct but not actionable without the specifics — what to measure first, which levers move the number most, and where the reliability risks actually are.

I've done cost optimization on OKE, EKS, and AKS clusters across client engagements. The patterns are consistent across clouds. Here's what actually moves the needle and in what order.

Measure before you optimize

The single most common mistake is starting with a solution before you understand the problem. "Use spot instances" is not a cost strategy — it's a tactic that may or may not apply to your situation.

The three questions that determine where your money is going:

What are the top namespaces by spend? In most clusters I've looked at, two or three namespaces account for 60–80% of resource consumption. Optimize those first. The rest is noise.

Which workloads are over-requesting? Pod resource requests determine how the scheduler places pods and how node pools are sized. An app that requests 4 CPU and actually uses 0.3 is costing you roughly 13x more than it needs to in compute allocation. This is the most common waste pattern I've seen.

Where is cluster-level waste coming from? Idle nodes, oversized node pools, and "forgotten" workloads (old preview environments, test namespaces that never got cleaned up) are typically the easiest wins. Idle nodes are pure waste — you're paying for capacity that serves no requests.

The tool I start with is kubectl top pods --all-namespaces --sort-by=cpu alongside historical utilization from Prometheus. The kubectl top snapshot shows current state; the Prometheus history shows the range. In most engagements I'm working with Prometheus + Grafana — occasionally OCI Monitoring on OKE or CloudWatch Container Insights on EKS, but Prometheus gives the most flexibility for the queries that matter here.

The request/limit problem

Kubernetes resource management has three numbers that matter:

Request: what the scheduler uses for placement, and what sets the node pool sizing
Limit: the ceiling the container cannot exceed
Actual usage: what the workload actually needs

The common patterns I find in over-provisioned clusters:

Requests set by default without measurement. Engineers set requests.cpu: 500m because that's what the Helm chart defaulted to, or because someone suggested it three years ago. Nobody has looked at whether that matches reality.

Limits set far above requests as a "safety margin." A pod with requests.cpu: 200m, limits.cpu: 4000m will be scheduled as if it needs 200m, but can consume 4000m in a burst. If several such pods burst simultaneously, you hit CPU throttling — and CPU throttling shows up as latency, not as a clean error, making it hard to diagnose.

No limits set at all. A pod with no limits is a noisy neighbor. One misbehaving workload can starve everything else on the node.

The right baseline: set requests at approximately the 90th percentile of actual usage. Set limits at approximately 2x requests for most workloads — enough headroom for bursts, not so much that a runaway process takes down the node.

Getting to these numbers requires a utilization data source. In practice I use a Grafana dashboard querying:

# 90th percentile CPU usage per pod over 7 days
quantile_over_time(0.90,
  rate(container_cpu_usage_seconds_total{container!=""}[5m])[7d:5m]
)

This works on Prometheus 2.x. The quantile_over_time subquery syntax requires the [7d:5m] range:step format — the step controls how many samples are taken, trading precision for query performance.

The VPA (Vertical Pod Autoscaler) can recommend request values based on historical usage. I use it in recommendation mode first — it tells you what the requests should be without changing them automatically. After validating the recommendations, I either apply them manually or enable VPA in Auto mode for workloads where dynamic resizing is safe. For stateful workloads and anything with strict memory requirements, I stay manual.

Benchmark: what right-sizing actually saves

On a recent OKE cluster audit, the before state across 40 workloads:

Average CPU request: 620m per pod
Average CPU actual (p90): 110m per pod
Cluster had 8 nodes × 8 vCPU = 64 vCPU allocated

After setting requests at p90 actual usage + 20% buffer:

Average CPU request: 135m per pod
Same 40 workloads fit comfortably on 4 nodes
Node count: 8 → 4
Compute cost: reduced by 33% on the total cluster bill (nodes are the largest line item but not the only one — load balancer, storage, and egress costs stayed constant)

The reliability story didn't change — the workloads were consuming the same compute, we just stopped paying for headroom that was never used.

Node pool strategy

Most default Kubernetes setups use a single node pool with a fixed instance type. That's fine for small clusters. At scale, a single node pool means every workload runs on the same hardware — batch jobs sit next to latency-sensitive APIs, memory-intensive data processing shares nodes with lightweight web servers.

The node pool split I reach for:

General pool: medium-sized nodes (8–16 vCPU), on-demand/reserved, for latency-sensitive workloads and anything that can't tolerate eviction. This is where your production APIs run.

Batch/spot pool: larger nodes (16–32 vCPU for density), spot/preemptible pricing, for workloads that are tolerant of eviction and can be retried. Data processing jobs, batch reconciliation, model inference jobs that aren't user-facing.

Memory-optimized pool (add when needed): for workloads with high memory-to-CPU ratios — caches, in-memory databases, heavy JVM workloads. Running these on general-purpose nodes overpays for CPU you don't use.

Node pools in OKE use OCI instance shapes. On EKS, Karpenter has largely replaced manual node groups in greenfield setups — it provisions exactly the instance type that fits the pending pod rather than rounding up to the next node group size. In my current engagements we're using managed node groups rather than Karpenter, partly because the clusters predate Karpenter's stability and migrating autoscalers on a live cluster carries risk. For new EKS clusters, Karpenter is worth starting with.

Spot/preemptible instances: the actual risk model

Spot instances are frequently oversold as a cost lever. The pitch is "save 60–90% on compute." The reality is that you have to design your workloads for eviction tolerance to realize that saving, and not all workloads qualify.

Eviction-tolerant workloads (good candidates for spot):

Batch processing jobs with checkpointing
Stateless workers pulling from a queue
CI/CD build runners
Model inference jobs (with retry logic)

Poor candidates for spot:

Stateful databases (Postgres, Redis, Kafka) — eviction risks data loss or corruption
Latency-sensitive APIs where a pod eviction during peak traffic affects user-facing response time
Anything with long startup time or complex initialization that makes restarting expensive

The correct implementation uses node affinity and taints/tolerations to keep spot-tolerant workloads on spot nodes and everything else on on-demand:

# On the spot node pool (set via node group labels)
# Node label: node.kubernetes.io/lifecycle=spot

# On batch workload pods
tolerations:
  - key: "node.kubernetes.io/lifecycle"
    operator: "Equal"
    value: "spot"
    effect: "NoSchedule"
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: "node.kubernetes.io/lifecycle"
              operator: "In"
              values: ["spot"]

With preferredDuringScheduling (not required), the workload falls back to on-demand nodes if spot capacity is unavailable. This keeps the cluster from failing to schedule when spot pools are constrained.

Cluster autoscaler vs. Karpenter

Cluster Autoscaler (CA) scales node groups up and down based on pending pods and underutilized nodes. It works, but it has two limitations at scale: it scales in pre-defined node group increments, and it's slow — scale-up decisions take 10–60 seconds by default.

Karpenter (AWS-native, also available for other clouds via community providers) provisions individual nodes to fit pending pods, picking the cheapest instance type that satisfies the request. The result is better bin-packing and faster scale-up.

In practice: on EKS clusters with variable workloads, Karpenter typically reduces node count by 15–25% compared to CA with fixed node groups — the improvement comes from picking smaller instances that fit actual pod requirements rather than rounding up to the nearest pre-defined node group size. I haven't measured this directly in my current engagements (we're on managed node groups), but the bin-packing improvement is well-documented in AWS's own benchmarks.

For OKE, the native cluster autoscaler handles node pool scaling. OCI's Instance Pools provide the underlying mechanism. The configuration is simpler than the AWS ecosystem but also less flexible — you're working with node pools rather than individual instance selection.

The cluster overhead you're paying for but rarely measure

Two categories of spend that don't show up in "right-size your pods" analysis:

Daemon set overhead. Every node runs DaemonSet pods — monitoring agents, log collectors, security scanners, network plugins. On a 16-vCPU node, a 200m CPU request per DaemonSet pod across five DaemonSets is 1 vCPU of overhead before any workload runs. On small nodes, DaemonSet overhead is a large fraction of total capacity. This is an argument for larger nodes when you have many DaemonSets.

Namespace sprawl. Teams that create a new namespace per feature branch for preview environments, and then don't clean them up, end up with 50+ namespaces running near-idle workloads. A simple policy: namespaces created by CI/CD are tagged with a TTL, and a cleanup job removes them after 72 hours if not renewed. This is operational hygiene but it consistently shows up in cost audits.

The reliability constraint

Cost optimization that regresses reliability is not optimization — it's trading one problem for another.

The guardrails I apply before any cost-reduction change:

Define SLOs before touching resource configuration. If you don't have a written SLO, you don't know what "reliability" means for that workload, which means you can't tell if you've broken it.
Change one thing at a time. Right-sizing a pod's requests, moving it to a spot node pool, and enabling VPA simultaneously makes it impossible to attribute a reliability regression.
Monitor for 48–72 hours after each change before declaring success. Latency problems from CPU throttling often take time to surface.

The actual cost reductions across a cluster optimization engagement typically break down like this: request right-sizing accounts for the largest share (40–60% of reduction), node pool splitting and spot adoption account for another 20–35%, and cleanup of idle workloads accounts for the remainder. The exact numbers depend heavily on the baseline state of the cluster.

If you're running Kubernetes on OCI, AWS, or Azure and want a structured audit — what's wasting spend, what the reliability risks are, and a prioritized remediation list — that's something I do as a standalone engagement. Contact me to talk through the specifics.