must be tuned to prevent cascading failures. Key parameters include maxUnavailable, maxSurge, minReadySeconds, and PodDisruptionBudgets.
Architecture Decision: Use minReadySeconds to allow time for health checks and warm-up before marking a pod as available. This prevents traffic from flowing to pods that are technically running but not yet ready to serve.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
minReadySeconds: 30
revisionHistoryLimit: 5
template:
spec:
containers:
- name: api
image: api:v1.2.0
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
2. Blue/Green Deployment
Blue/Green maintains two identical environments. The Service selector switches traffic from the active (Blue) version to the idle (Green) version only after validation.
Architecture Decision: This pattern requires double the compute resources. It is best suited for critical services where downtime is unacceptable and resource costs are secondary to stability. Validation should be automated via smoke tests against the Green deployment before switching.
# Active Service pointing to Blue
apiVersion: v1
kind: Service
metadata:
name: api-service
spec:
selector:
app: api
version: blue
ports:
- port: 80
targetPort: 8080
---
# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-blue
spec:
replicas: 3
selector:
matchLabels:
app: api
version: blue
template:
metadata:
labels:
app: api
version: blue
spec:
containers:
- name: api
image: api:v1.1.0
---
# Green Deployment (Idle)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-green
spec:
replicas: 3
selector:
matchLabels:
app: api
version: green
template:
metadata:
labels:
app: api
version: green
spec:
containers:
- name: api
image: api:v1.2.0
Switch Mechanism:
To promote Green, update the Service selector:
kubectl patch service api-service -p '{"spec":{"selector":{"version":"green"}}}'
3. Canary Deployment with Service Mesh
Canary requires traffic splitting capabilities beyond standard Kubernetes Services. A Service Mesh (Istio, Linkerd) or advanced Ingress Controller is required to route percentages of traffic based on headers, weights, or metrics.
Architecture Decision: Canary is mandatory for high-risk releases. It must be coupled with automated analysis. Tools like Argo Rollouts or Flagger can automate traffic shifting based on error rates and latency. Manual canary promotion is error-prone and slow.
# Istio VirtualService for Canary Weighting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: api-canary
spec:
hosts:
- api-service
http:
- route:
- destination:
host: api-service
subset: stable
weight: 90
- destination:
host: api-service
subset: canary
weight: 10
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: api-destination
spec:
host: api-service
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
4. Shadow Deployment
Shadowing duplicates traffic to a new version. The response from the shadow version is discarded. This validates performance and side effects without impacting user experience.
Architecture Decision: Use Shadow for database migration testing, latency profiling, and integration validation. Ensure the shadow service handles idempotency if it writes to external systems, or configure it to use a shadow database.
# Istio Traffic Mirroring
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: api-shadow
spec:
hosts:
- api-service
http:
- route:
- destination:
host: api-service
subset: stable
mirror:
host: api-service
subset: shadow
mirrorPercentage:
value: 100.0
Pitfall Guide
-
Ignoring minReadySeconds: Without this, Kubernetes marks a pod ready immediately after the readiness probe passes. If the application requires warm-up time (e.g., loading caches, establishing connections), traffic may spike on a pod that cannot handle the load, causing latency spikes.
- Best Practice: Set
minReadySeconds to exceed the application warm-up duration.
-
Database Schema Incompatibility: Deploying a new version with a schema change that breaks the old version prevents rollback. If the new schema removes a column, the old version cannot function.
- Best Practice: Enforce backward-compatible schema changes. Use the expand/contract pattern: add new columns/tables first, deploy code to use them, then remove old artifacts in a subsequent release.
-
Resource Starvation During Surges: Configuring maxSurge without calculating cluster capacity can lead to FailedScheduling events. If the cluster is near capacity, the surge pods will pend, stalling the rollout.
- Best Practice: Implement Cluster Autoscaling and calculate
maxSurge based on available buffer resources. Use PodDisruptionBudgets to prevent voluntary disruptions from compounding resource pressure.
-
Canary Without Metrics: Promoting a canary based on intuition or static time intervals defeats the purpose. If error rates spike in the canary pod, manual promotion will propagate the issue.
- Best Practice: Integrate canary analysis with Prometheus/Grafana. Automate promotion and rollback based on thresholds for error rate, latency, and saturation.
-
Sticky Sessions Breaking Canary: If a load balancer uses sticky sessions based on IP or headers, traffic splitting at the Service Mesh level may be ineffective. Users may remain pinned to the old version despite weight changes.
- Best Practice: Disable sticky sessions during canary deployments or ensure traffic splitting occurs after session affinity is resolved.
-
Misconfigured revisionHistoryLimit: The default retention of old ReplicaSets is 10. In high-frequency deployment environments, this can exhaust etcd storage or prevent rollback to a specific historical version.
- Best Practice: Set
revisionHistoryLimit explicitly based on compliance requirements and storage constraints. Use GitOps to maintain history outside the cluster.
-
Blue/Green Resource Leakage: Failing to scale down the inactive environment after a successful switch wastes resources. Conversely, scaling down too early prevents instant rollback if the new version fails hours later.
- Best Practice: Automate the teardown of the inactive environment after a defined stabilization period. Use labels and garbage collection policies to manage lifecycle.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-risk internal tool | RollingUpdate | Simplicity and low resource overhead outweigh risk. | Low |
| Financial transaction service | Blue/Green | Zero downtime and instant rollback are critical. | High (2x resources) |
| Customer-facing API update | Canary | Minimizes blast radius; allows data-driven promotion. | Medium (Mesh + extra pods) |
| Performance optimization | Shadowing | Validates impact on production traffic without user risk. | Medium (Mirror traffic cost) |
| Database migration | Canary + Expand/Contract | Allows gradual shift to new schema version safely. | Medium |
| Emergency fix | Blue/Green | Fastest path to restore service if current version is broken. | High |
Configuration Template
This template provides a production-ready Blue/Green setup with Service selector management and PDB protection.
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service-blue
labels:
app: payment-service
version: blue
spec:
replicas: 3
revisionHistoryLimit: 5
selector:
matchLabels:
app: payment-service
version: blue
template:
metadata:
labels:
app: payment-service
version: blue
spec:
containers:
- name: payment
image: payment-service:v2.1.0
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service-green
labels:
app: payment-service
version: green
spec:
replicas: 3
revisionHistoryLimit: 5
selector:
matchLabels:
app: payment-service
version: green
template:
metadata:
labels:
app: payment-service
version: green
spec:
containers:
- name: payment
image: payment-service:v2.2.0
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: payment-service
spec:
selector:
app: payment-service
version: blue
ports:
- port: 80
targetPort: 8080
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: payment-service
Quick Start Guide
-
Initialize Cluster Access: Ensure kubectl is configured and you have cluster-admin or namespace-level permissions.
kubectl cluster-info
-
Apply Baseline Manifest: Deploy the Blue/Green template provided above.
kubectl apply -f blue-green-deployment.yaml
-
Verify Active Version: Check that the Service routes traffic to the Blue version.
kubectl get svc payment-service -o jsonpath='{.spec.selector}'
-
Simulate Promotion: Patch the Service selector to switch traffic to Green.
kubectl patch svc payment-service -p '{"spec":{"selector":{"version":"green"}}}'
-
Validate Rollback: Revert selector to Blue to confirm rollback capability.
kubectl patch svc payment-service -p '{"spec":{"selector":{"version":"blue"}}}'
-
Monitor: Observe pod status and service endpoints during transitions.
kubectl get endpoints payment-service -w