Every tutorial for deploying ML models on Kubernetes follows the same path: create a Deployment, set up a Service, maybe wire in an HPA on CPU utilization, call it done. That path is fine for getting something running in an afternoon. It is not fine for production inference workloads, and the gap between the two is where engineers spend weeks debugging things that should have been designed correctly from the start.
I have set up inference clusters on EKS and OKE across several engagements, mostly for transformer-based models in the 7B--70B parameter range and for embedding models serving RAG pipelines. This is what I actually do, including the pieces that took me multiple failures to get right.
GPU node pool configuration: the decisions that matter before you write a single deployment manifest
The first decision is instance type, and it is more consequential than it looks. For inference (not training), the bottleneck is almost always memory bandwidth, not raw FLOPS. A GPU with 80GB HBM3 memory will outperform a GPU with higher peak FLOPS but slower memory for most inference workloads. For 7B models running at FP16, you need roughly 14GB of VRAM just for weights -- before the KV cache, before activation memory. For 70B models at FP16, that is 140GB, which means you are either sharding across multiple GPUs or quantizing down.
On AWS, my default for latency-sensitive inference serving is ml.g5.12xlarge (4x A10G, 96GB total VRAM) or p3.8xlarge (4x V100, 64GB total VRAM) depending on model size. For batch inference where throughput matters more than latency, I have used g4dn.12xlarge (4x T4) -- the T4 is older hardware but the cost per token for offline workloads is much better than the A-series. On OCI, the BM.GPU.A10.4 bare metal shape is the one I reach for -- the A10G's are good for inference, and OCI's bare metal pricing for GPUs is substantially lower than AWS on-demand for equivalent hardware.
The node pool setup itself:
# EKS managed node group for GPU inference
# This is the NodeGroup configuration, not the workload YAML
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: inference-cluster
region: us-east-1
managedNodeGroups:
- name: gpu-inference-ondemand
instanceType: g5.12xlarge
minSize: 1
maxSize: 10
desiredCapacity: 2
labels:
workload-type: inference
gpu-tier: ondemand
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
tags:
cost-center: inference-prod
iam:
withAddonPolicies:
cloudWatch: true
- name: gpu-inference-spot
instanceType: g5.12xlarge
spot: true
minSize: 0
maxSize: 20
desiredCapacity: 0
labels:
workload-type: inference-batch
gpu-tier: spot
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
- key: spot-instance
value: "true"
effect: NoSchedule
Two separate node groups -- on-demand for latency-sensitive requests, spot for batch inference jobs. The taint on GPU nodes forces explicit opt-in via tolerations on every workload. Without that taint, you will eventually have a non-GPU workload scheduled onto a GPU node, wasting expensive capacity that has nothing to do with inference.
The NVIDIA device plugin is what makes nvidia.com/gpu available as a schedulable resource. On EKS, I use the Helm chart rather than the DaemonSet manifest directly:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace kube-system \
--create-namespace \
--version 0.16.2 \
--set failOnInitError=false
The failOnInitError=false flag is important -- on mixed clusters where not every node has a GPU, the DaemonSet will fail to initialize on CPU-only nodes without it. That seems obvious, but I have seen it cause confusing DaemonSet restart loops more than once on clusters that started GPU-only and later added CPU node pools.
One thing worth stating explicitly: the NVIDIA device plugin allocates whole GPU devices, not fractional ones. If your model fits in 10GB and your GPU has 40GB, you are consuming a 40GB GPU for a 10GB workload. Time-slicing via nvidia.com/gpu.shared configuration lets you pack multiple inference replicas onto a single GPU, which changes the cost math significantly for smaller models. I have used time-slicing in staging environments but have not run it in production for latency-sensitive workloads -- the latency variance from contention is unpredictable enough that I want dedicated GPU allocation for anything user-facing. For batch workloads on spot nodes, time-slicing is worth evaluating.
Autoscaling: why CPU-based HPA does not work and what to use instead
The default horizontal pod autoscaling story in Kubernetes is CPU utilization. For inference, CPU is the wrong signal. GPU inference pods can sit at 5% CPU while their GPU is saturated at 100%. Scaling on CPU means you never scale up when you need to and occasionally scale down when the GPU is still handling requests.
There are three metrics that actually matter for inference autoscaling:
Queue depth / pending requests. If you are serving through a queue (Celery, SQS, Kafka, Redis Streams), the length of that queue is a direct signal of demand. A queue depth above your target means you need more replicas. This is the cleanest signal for batch inference.
Inference latency (p95 or p99). For synchronous request-response serving, latency is the signal. If p95 latency crosses a threshold, you are under-provisioned. This requires your serving layer to expose latency metrics -- vLLM, TGI, and Triton all do.
GPU memory utilization. Less useful as a primary autoscaling signal (memory utilization does not directly map to throughput), but important as a guard rail -- if GPU memory is above 90%, adding tokens to the KV cache will fail and requests will error, not just slow down.
I use KEDA (Kubernetes Event-Driven Autoscaling) for all three. KEDA can scale on external metrics from Prometheus, SQS queue depth, Redis list length, and a dozen other sources that HPA cannot touch without custom metrics adapter gymnastics. On EKS, KEDA runs as a Helm release:
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--version 2.15.1
A ScaledObject for a vLLM inference deployment scaling on Prometheus-scraped request queue length:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-inference-scaler
namespace: inference
spec:
scaleTargetRef:
name: vllm-llama3-deployment
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 120
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
metricName: vllm_pending_requests
threshold: "5"
query: sum(vllm_request_queue_size{deployment="llama3-8b"})
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
metricName: vllm_p95_latency_seconds
threshold: "2.0"
query: histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket{deployment="llama3-8b"}[5m]))
Two triggers. KEDA scales up when either one breaches the threshold. The cooldownPeriod: 120 is deliberate -- scaling down too fast after a traffic burst is a problem I will get to in the pre-warming section.
The failure mode I hit with this configuration: the Prometheus query returning no data during off-hours (zero requests) caused KEDA to scale to minReplicaCount correctly, but when the first request came in after a quiet period, it had already scaled down to one replica and the pending requests metric spiked before that replica could handle load, triggering an additional scale-up. The new replica took 4--6 minutes to load the model, which meant a multi-minute degradation period on every cold-start-after-quiet scenario. That led me directly to the pre-warming work.
Model loading time: the problem tutorials skip entirely
A 7B model in FP16 is roughly 14GB. Loading that from EBS or S3 into GPU VRAM takes time. On a g5.12xlarge with NVMe-backed instance storage, I measure 90--120 seconds from pod start to first successful inference request. On a pod that reads the model from an S3-backed persistent volume with no caching, I have seen 4--7 minutes. That number is what kills your p99 latency during scale-up events.
Three mechanisms I use to address this:
Node-level model caching. Mount the model weights directory as a hostPath volume and pre-populate it on each GPU node at node bootstrap time. When the inference pod starts, the model load comes from the local NVMe cache rather than S3. This requires custom node bootstrap scripts (user data on EKS managed nodes) and a mechanism to keep the cache warm when the node pool scales from zero. I use an EKS user data script that pulls the model weights from S3 to a local path at instance initialization time -- the GPU node is unusable until the model is present anyway, so the added initialization time is unavoidable and the cold start pays once per node rather than once per pod.
#!/bin/bash
# Snippet from EKS node group user data (runs at node boot)
# Assumes AWS CLI and appropriate IAM role on the instance profile
MODEL_S3_PATH="s3://my-model-bucket/llama3-8b-instruct-fp16/"
LOCAL_CACHE="/mnt/models/llama3-8b-instruct"
mkdir -p $LOCAL_CACHE
# aws s3 sync with no-progress flag to avoid flooding cloud-init logs
aws s3 sync $MODEL_S3_PATH $LOCAL_CACHE \
--no-progress \
--exact-timestamps \
2>&1 | tee /var/log/model-sync.log
echo "Model sync complete: $(date)" >> /var/log/model-sync.log
This script is embedded in the launch template. It adds 3--5 minutes to node bootstrap time, but once the node is ready, every subsequent pod start on that node loads in under 2 minutes.
PodDisruptionBudget and minAvailable. Keep at least one replica always running, even during cluster upgrades and node evictions:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
namespace: inference
spec:
minAvailable: 1
selector:
matchLabels:
app: vllm-llama3
Combined with the KEDA minReplicaCount: 1, this means the cluster will maintain at least one warm replica at all times. The PDB prevents voluntary disruptions (node drains, rolling upgrades) from evicting the last inference pod before a replacement is ready.
Readiness probes with realistic thresholds. This is where a lot of implementations go wrong. The default readiness probe checks HTTP 200 on /health. vLLM's /health endpoint returns 200 before the model is loaded -- it means the server process started, not that inference is ready. The correct check is /v1/models, which only returns the model listing after the model is fully loaded into VRAM:
readinessProbe:
httpGet:
path: /v1/models
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 30
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 90
periodSeconds: 30
failureThreshold: 3
timeoutSeconds: 10
The failureThreshold: 30 on readiness with periodSeconds: 10 gives 300 seconds (5 minutes) before Kubernetes marks the pod as failed. That is enough time for the 7B model to load from a warm cache. Tighten this and Kubernetes will kill your pod during model loading and restart it, producing an infinite restart loop that is genuinely confusing to debug the first time you see it.
Monitoring for model drift before users notice
Infra monitoring (GPU utilization, memory, latency, error rates) is the obvious layer and I will not dwell on it -- the vLLM and TGI metrics endpoints expose everything you need for Prometheus to scrape. The less obvious layer is model-level observability: are the outputs degrading in quality before the error rate climbs?
The monitoring work that actually catches problems:
Output length distribution. Track the p50 and p95 of token count per response over time. A sudden shift in output length distribution -- responses getting systematically shorter or longer -- is often the first signal that something changed upstream (model update, prompt template regression, tokenizer issue). I publish this as a custom metric from the inference service:
# In the inference service wrapper (FastAPI over vLLM)
from prometheus_client import Histogram
response_tokens = Histogram(
"inference_response_tokens",
"Token count of inference responses",
buckets=[10, 25, 50, 100, 200, 400, 800, 1600]
)
@app.post("/v1/completions")
async def completions(request: CompletionRequest):
result = await vllm_engine.generate(request.prompt, sampling_params)
token_count = len(result.outputs[0].token_ids)
response_tokens.observe(token_count)
return result
Alert on: histogram_quantile(0.50, rate(inference_response_tokens_bucket[1h])) deviating more than 20% from the 7-day baseline. This fires occasionally on legitimate input distribution shifts, but it has also caught two real problems -- one prompt template regression and one model weight corruption -- before user complaints came in.
Embedding drift (for RAG inference). For embedding models serving a RAG pipeline, I track cosine similarity of embeddings for a fixed synthetic probe set. Every 15 minutes, the monitoring sidecar sends 10 fixed sentences through the embedding endpoint and computes the similarity matrix against a reference snapshot taken at deployment time. A meaningful drift (I use a threshold of 0.05 average cosine distance shift) triggers an alert. This is the technique Chip Huyen describes in "Designing Machine Learning Systems" for detecting embedding model regression, and it works -- I caught a library version conflict that was subtly affecting embedding quality this way on a client RAG deployment that would have otherwise been invisible until retrieval quality degraded noticeably.
Structured output failure rate. If the inference workload is generating JSON via structured output (constrained decoding in vLLM, or schema enforcement in Outlines), track the fraction of requests that require retries or fail schema validation. A rising failure rate against a fixed prompt distribution means the model is drifting or the constrained generation is running into edge cases introduced by a new request pattern.
The Grafana dashboard I run has four panels in the first row: GPU memory utilization, p95 inference latency, request queue depth, and output token p50. Those four numbers tell me the state of the system faster than anything else. The second row has the model-level signals: response length distribution, structured output failure rate, and embedding drift score where applicable.
Cost: Spot for batch, On-Demand for latency-sensitive
The cost model for inference is different from general Kubernetes workloads. GPU instances are expensive -- a g5.12xlarge is roughly $5.67/hour on-demand as of mid-2026. Spot pricing for the same instance type is typically 60--70% lower, around $1.80--$2.20/hour, but with the eviction risk.
The split I use:
On-demand node pool handles synchronous user-facing inference. The SLA requires responding within a latency budget -- a spot eviction mid-request, where the model needs 4 minutes to reload, is not acceptable. On-demand here is not optional.
Spot node pool handles everything else: async batch jobs (generating embeddings for a new document corpus, running inference over historical data, offline evaluation runs), non-latency-sensitive background summarization. These jobs are designed with eviction tolerance from the start -- checkpointing, idempotent task units, retry logic.
For the spot pool, I use KEDA to scale from zero when there are no pending batch jobs. A spot pool sitting at zero desiredCapacity costs nothing. The 5-minute cold start cost (model load time) is acceptable for a batch job that runs for 30--60 minutes.
The approximate cost breakdown on a recent engagement: a cluster handling 2,000 synchronous inference requests/day (averaging 200 output tokens) and 50,000 embedding generations/day (for a RAG pipeline re-indexing cycle) ran at roughly $380/month on EKS -- $290 for the on-demand GPU nodes (1--2 replicas of g5.12xlarge running ~12 hours/day with overnight scale-down via a scheduled ScaledObject) and ~$90 for spot-based batch embedding jobs. Without the spot/on-demand split and overnight scale-down, the same workload would have been closer to $900/month.
The scheduled scale-down is worth implementing explicitly:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scheduled-scaler
namespace: inference
spec:
scaleTargetRef:
name: vllm-llama3-deployment
minReplicaCount: 0
maxReplicaCount: 4
triggers:
- type: cron
metadata:
timezone: "Asia/Kolkata"
start: "30 7 * * 1-5"
end: "0 22 * * 1-5"
desiredReplicas: "1"
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
metricName: vllm_pending_requests
threshold: "5"
query: sum(vllm_request_queue_size{deployment="llama3-8b"})
The cron trigger ensures one warm replica during business hours. Outside that window, if the Prometheus trigger is also zero (no pending requests), KEDA scales to zero. The minReplicaCount: 0 at the ScaledObject level allows this -- but only works if you are genuinely comfortable with cold starts during off-hours. For internal tooling, I am. For a public-facing product, I keep minReplicaCount: 1.
Failure modes the tutorials do not cover
GPU OOM during KV cache growth. vLLM pre-allocates KV cache at startup based on gpu_memory_utilization (default 0.90, meaning 90% of available VRAM reserved for KV cache). On a 40GB A10G with a 7B model consuming 14GB of weights, that leaves about 23GB for KV cache. Under high concurrency with long context requests, that cache fills and vLLM starts rejecting requests with RESOURCE_EXHAUSTED errors rather than queuing them. The error looks like an OOM but it is a KV cache exhaustion -- it shows up in vLLM logs as Cannot schedule request: out of kv cache memory. The fix is to tune max_num_sequences (concurrent sequences) and max_model_len (maximum context length) to match your actual traffic distribution, not the theoretical model maximum. For most production request distributions, you are not serving max-context requests -- sizing as if you are wastes KV cache capacity.
Rolling update evicting the only warm replica. A Deployment rolling update with maxUnavailable: 1 will evict the current pod before the new one's readiness probe passes. With a 90-second model load time and a readiness probe that takes up to 300 seconds to clear, you have a period where zero replicas are ready. I set maxUnavailable: 0, maxSurge: 1 for inference deployments -- the rollout creates a new pod, waits for it to pass readiness (which may take 5 minutes), then terminates the old one. Slower rollout, zero downtime.
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
NVIDIA driver / CUDA version mismatch after node pool AMI update. On EKS managed node groups, AWS periodically updates the AMI for the node group. If the new AMI ships with a different CUDA version, and your model server container was built against an older CUDA runtime, the container silently falls back to CPU execution -- no error at startup, just 100x slower inference and confused application-level latency alerts going off. I learned this one the hard way. The fix: pin the EKS optimized GPU AMI version explicitly in the launch template and upgrade it intentionally rather than accepting automatic AMI rollout. Also worth adding a startup check that validates CUDA is actually being used:
import torch
import logging
logger = logging.getLogger(__name__)
if not torch.cuda.is_available():
logger.error("CUDA not available -- inference will run on CPU. Refusing to start.")
raise RuntimeError("CUDA unavailable at startup")
logger.info(f"CUDA device: {torch.cuda.get_device_name(0)}, "
f"CUDA version: {torch.version.cuda}, "
f"Available VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
Fail loud at startup. Do not let a misconfigured GPU node serve CPU-speed inference silently.
Node pool scale-up during GPU capacity crunch. When AWS or OCI is running low on GPU instances in a given availability zone, spot and on-demand requests may both fail. Cluster Autoscaler will keep trying to provision nodes and failing, logging repeated errors. The inference deployment sits with pending pods and the queue builds. The fix is multi-AZ node groups with GPU capacity in multiple AZs -- when one AZ is out of a specific instance type, the scheduler can land the pod in another. Also worth setting up instance type diversification: specifying g5.12xlarge and g4dn.12xlarge as acceptable types in the same node group allows the autoscaler to substitute a slightly older GPU if the preferred type is unavailable.
The honest state of this in mid-2026
GPU node pool management, autoscaling on inference-specific metrics, and model loading time are solvable problems. They require more engineering than a tutorial covers, but none of it is magic. The harder problem that I do not have a clean solution for: multi-tenant GPU sharing across teams with different models and different SLA requirements on the same cluster. GPU time-slicing and MIG (Multi-Instance GPU on A100/H100 class hardware) both work, but the scheduling semantics interact with Kubernetes resource management in ways that produce surprising behavior -- a "low priority" batch job holding a MIG partition blocking a high-priority inference request is a real scenario I have hit. If you are building a shared inference platform for multiple teams and models, that problem deserves its own post.
For the setup I have described -- dedicated cluster, one to three production models, batch + synchronous workloads -- this configuration is what I would start with. It is not minimal, but it does not have the failure modes I have spent cycles debugging.
Most of the inference infrastructure work I do now involves building evaluation and observability pipelines on top of these serving setups -- connecting the inference layer to prompt evaluation, drift detection, and CI/CD for model updates. That work is part of what I am building into Pipeshift. If you are setting up inference infrastructure and want a review of your architecture before it becomes a production problem, reach out.