Start with measurement
Before optimization, ensure you can answer:
- What are the top namespaces by spend?
- Which workloads are over-requesting CPU/memory?
- Where is cluster waste coming from (idle nodes, oversized pools, noisy neighbors)?
High-leverage changes
- right-size requests/limits using observed utilization
- enable HPA/VPA where it makes sense (with safety bounds)
- split node pools by workload type (batch vs latency-sensitive)
- use spot/preemptible for tolerant workloads
Guardrails
Cost work must not regress reliability:
- define SLOs first
- ship changes incrementally
- add rollback strategies