plementation follows a strict dependency chain: right-size workloads β scale pods horizontally β scale nodes vertically β enforce stability constraints.
Step 1: Baseline Resource Requests
HPA calculates utilization as a percentage of resources.requests. If requests are missing or misaligned, HPA cannot function. Use VPA in Recommendation mode to gather historical usage, then apply the suggested values. Do not use VPA Auto mode in production without HPA, as it recreates pods unpredictably.
HPA should target metrics that reflect actual load. For web services, requests-per-second or active-connections outperform CPU. For async workers, queue depth or lag is appropriate. Define stabilization windows to prevent thrashing during traffic volatility.
Step 3: Integrate Cluster Autoscaler
CA monitors pending pods and scales node groups accordingly. It requires cloud provider integration (AWS ASG, GCE MIG, Azure VMSS) and proper node group tagging. CA respects PodDisruptionBudgets and node affinity, but misconfigured taints or labels will block scaling.
Step 4: Deploy Event-Driven Scaling (Optional)
For workloads triggered by external systems (Kafka, RabbitMQ, SQS, HTTP webhooks), KEDA replaces or supplements HPA. KEDA polls external scalers, calculates desired replicas, and updates the HPA target. This reduces polling overhead and aligns scaling with event velocity.
Step 5: Enforce Stability & Safety
Define PodDisruptionBudgets to prevent mass evictions during scale-down. Use behavior.scaleDown.stabilizationWindowSeconds to delay removal of idle pods, allowing traffic bursts to reuse existing capacity without cold starts.
Architecture Rationale
- Metric Selection Over CPU: CPU is a proxy for compute, not demand. Network or queue metrics directly correlate with user impact.
- Stabilization Windows: Default 300s is too slow for modern SLAs. Scale-up should be 30-60s; scale-down should be 120-180s to absorb traffic rehydration.
- VPA + HPA Coupling: VPA sets accurate requests; HPA scales replicas. Using VPA
Auto alone causes pod churn and breaks stateful assumptions.
- CA Node Group Alignment: CA scales node pools, not individual nodes. Node groups must have matching labels, taints, and instance types to avoid scheduling deadlocks.
Code Examples
HPA with Custom Metric (Prometheus Adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-frontend-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-frontend
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500"
behavior:
scaleUp:
stabilizationWindowSeconds: 45
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 150
policies:
- type: Percent
value: 20
periodSeconds: 120
VPA in Recommendation Mode
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-frontend-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-frontend
updatePolicy:
updateMode: "Off" # Recommendation only
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "2Gi"
KEDA ScaledObject for Kafka
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaledobject
spec:
scaleTargetRef:
name: kafka-consumer
pollingInterval: 15
cooldownPeriod: 30
minReplicaCount: 2
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker:9092
consumerGroup: processing-group
topic: events
lagThreshold: "100"
Pitfall Guide
1. Missing or Misaligned Resource Requests
HPA calculates utilization as (current usage / request) * 100. Without requests, HPA defaults to 0, triggering immediate scale-up to maxReplicas. Misaligned requests cause premature scaling or scheduler starvation. Always define requests; use VPA recommendations to calibrate.
2. Aggressive Scale-Up Thresholds
Setting targetAverageValue too low creates artificial saturation. If a pod handles 500 req/s comfortably, targeting 200 req/s forces unnecessary replica creation. Calibrate thresholds against load testing data, not theoretical capacity.
3. Ignoring Pod Startup Latency
Scale-up requests new pods, but initialization (image pull, init containers, health checks) adds 10-40s. If the HPA targets a metric that spikes faster than startup time, traffic hits unready pods. Use readinessGates and align stabilizationWindowSeconds with actual boot time.
4. VPA Auto Mode in Production
VPA Auto recreates pods to apply new resource requests. This breaks in-memory caches, active connections, and stateful assumptions. Use Off or Initial mode. Apply recommendations manually or via GitOps pipelines.
5. Cluster Autoscaler Node Group Misconfiguration
CA scales node groups, not individual nodes. If node groups lack proper labels, taints, or instance type diversity, pending pods remain unschedulable. CA will not provision nodes that violate affinity rules or exceed cluster resource limits. Verify node group tags and scheduling constraints.
6. Custom Metric Pipeline Bottlenecks
Prometheus scrape intervals, adapter latency, and metric cardinality directly impact HPA responsiveness. A 30s scrape interval + 15s adapter delay = 45s feedback lag. Align scrape intervals with workload volatility. Avoid high-cardinality labels in scaling metrics.
7. Scale-Down Thrashing Without Stabilization
Default 300s scale-down delay is often disabled, causing rapid pod eviction during traffic dips. Subsequent spikes recreate pods, wasting cold-start time. Enforce scaleDown.stabilizationWindowSeconds and monitor kubectl get hpa events to detect oscillation.
Best Practices from Production
- Right-size before autoscaling. VPA recommendations eliminate guesswork.
- Use custom metrics for I/O-bound workloads. CPU is a lagging indicator.
- Align stabilization windows with actual startup/shutdown times.
- Test scaling behavior under controlled load injection before production rollout.
- Monitor
kube_pod_status_phase and scheduler_pending_pods to detect CA bottlenecks.
- Never disable PDBs during scaling events. They prevent cascading failures.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Stateful web service with predictable traffic | HPA + Custom Metrics (Prometheus) | Aligns scaling with actual request load, reduces idle replicas | -25% waste |
| Async job processor with bursty queues | KEDA + Queue Depth Trigger | Event-driven scaling eliminates polling overhead and cold starts | -35% waste |
| Variable memory footprint workloads | VPA (Recommendation) + HPA | Right-sizes requests, prevents OOMKills and scheduler starvation | -20% waste |
| Multi-tenant cluster with strict SLOs | HPA + PDB + CA + Stabilization Windows | Prevents cascading failures, maintains capacity during scale-down | Neutral (stability gain) |
| Low-traffic internal tools | Static provisioning + VPA (Off) | Autoscaling overhead exceeds benefit; right-sizing suffices | -15% waste |
Configuration Template
# namespace: autoscaling-demo
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-app
spec:
replicas: 3
selector:
matchLabels:
app: demo-app
template:
metadata:
labels:
app: demo-app
spec:
containers:
- name: app
image: myregistry/demo-app:latest
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: demo-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: demo-app
updatePolicy:
updateMode: "Off"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: demo-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: demo-app
minReplicas: 3
maxReplicas: 15
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 45
scaleDown:
stabilizationWindowSeconds: 150
policies:
- type: Percent
value: 25
periodSeconds: 60
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: demo-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: demo-app
Quick Start Guide
- Apply resource requests: Ensure all deployments define
resources.requests. Deploy VPA in Off mode and collect metrics for 7 days.
- Create HPA manifest: Use the template above. Replace
scaleTargetRef with your deployment name. Adjust minReplicas, maxReplicas, and metric thresholds based on load test data.
- Deploy PDB: Apply the
PodDisruptionBudget to prevent scale-down from evicting all replicas simultaneously.
- Verify scaling behavior: Run
kubectl get hpa -w and inject traffic using hey or k6. Confirm replica count increases within 45-60s and stabilizes.
- Monitor and tune: Check
kubectl describe hpa <name> for metric readings and scaling events. Adjust stabilization windows and thresholds if oscillation occurs. Enable Cluster Autoscaler node group tagging if pods remain pending.