ive Delivery Controller
Argo Rollouts is the industry-standard controller for this pattern. It extends the K8s API with a Rollout resource that manages stable/canary services, traffic routing, and metric analysis.
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
Step 2: Define the Rollout and Traffic Services
The architecture uses two services: stable-svc (production traffic) and canary-svc (test traffic). The controller shifts traffic between them based on analysis results.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-service
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 60s}
- setWeight: 25
- pause: {duration: 60s}
- analysis:
templates:
- templateName: error-rate-check
- setWeight: 50
- pause: {duration: 60s}
- setWeight: 100
revisionHistoryLimit: 3
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
containers:
- name: api
image: registry.internal/api-service:v2.4.1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: stable-svc
spec:
selector:
app: api-service
role: stable
ports:
- port: 80
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: canary-svc
spec:
selector:
app: api-service
role: canary
ports:
- port: 80
targetPort: 8080
Native K8s cannot evaluate business SLOs. A TypeScript-based controller extension or CI/CD hook queries Prometheus and gates promotion. This satisfies the requirement for application-layer validation while keeping K8s manifests declarative.
import { PrometheusAdapter } from '@prometheus-io/client';
import { KubeConfig, AppsV1Api } from '@kubernetes/client-node';
export class RolloutValidator {
private prometheus: PrometheusAdapter;
private k8s: AppsV1Api;
constructor() {
const kc = new KubeConfig();
kc.loadFromCluster();
this.k8s = kc.makeApiClient(AppsV1Api);
this.prometheus = new PrometheusAdapter({ baseURL: process.env.PROMETHEUS_URL });
}
async validateCanary(namespace: string, rolloutName: string): Promise<boolean> {
// Fetch error rate for canary pods over last 5 minutes
const query = `sum(rate(http_requests_total{namespace="${namespace}", rollout="${rolloutName}-canary", status=~"5.."}[5m])) / sum(rate(http_requests_total{namespace="${namespace}", rollout="${rolloutName}-canary"}[5m]))`;
const result = await this.prometheus.queryRange({
query,
start: Date.now() / 1000 - 300,
end: Date.now() / 1000,
step: 30
});
const errorRate = result.data.result[0]?.values?.at(-1)?.[1] ?? 0;
const threshold = 0.02; // 2% max error rate
if (errorRate > threshold) {
console.warn(`Canary validation failed: error rate ${errorRate} > ${threshold}`);
await this.triggerRollback(namespace, rolloutName);
return false;
}
console.log(`Canary validated: error rate ${errorRate}`);
return true;
}
private async triggerRollback(namespace: string, name: string): Promise<void> {
const body = { spec: { paused: true } };
await this.k8s.patchNamespacedRollout(name, namespace, body, undefined, undefined, undefined, undefined, {
headers: { 'Content-Type': 'application/strategic-merge-patch+json' }
});
}
}
Architecture Decisions & Rationale
- CRD over Deployment:
Rollout maintains separate stable/canary service selectors, enabling traffic controllers (ALB, Istio, Nginx) to route based on service endpoints rather than pod labels.
- Pause Steps: Fixed-duration pauses prevent rapid promotion before metrics stabilize. Production systems require 60β120s windows for APM data to propagate.
- Analysis Templates: Tying promotion to
AnalysisTemplate resources externalizes metric definitions, allowing reuse across services and version control.
- TypeScript Validation Layer: K8s controllers operate on state reconciliation, not time-series evaluation. A lightweight TS/Node process bridges Prometheus metrics to rollout state, enabling SLO-gated promotion without bloating the control plane.
Pitfall Guide
-
Treating replicas: 1 as a canary
A single pod does not isolate traffic. Without a dedicated canary service and routing controller, requests still hit the new binary based on kube-proxy round-robin. Result: unpredictable failure distribution.
-
Missing or misconfigured readiness probes
If readiness probes return 200 OK before application initialization completes, the controller marks pods as ready and shifts traffic to unready instances. Always validate downstream dependencies (DB pools, cache connections, auth tokens) in readiness checks.
-
Ignoring PodDisruptionBudgets during traffic shifts
PDBs protect against voluntary disruptions. During canary analysis, the controller may evict old replicas to match target weights. Without PDBs, this causes simultaneous pod churn and capacity drops. Define minAvailable: 2 or maxUnavailable: 1 per service.
-
Hardcoding image digests without tag immutability
Using :latest or mutable tags breaks rollback determinism. If a tag is overwritten, the previous SHA is unrecoverable. Enforce immutable tags or SHA256 digests in CI/CD. Store tag-to-digest mappings in a registry manifest.
-
Relying on manual promotion without metric thresholds
Human-driven canary promotion introduces latency and inconsistency. Operators promote too early (missed errors) or too late (wasted canary capacity). Bind setWeight steps to AnalysisTemplate thresholds that evaluate latency, error rate, and saturation.
-
Mixing service mesh and ingress controller responsibilities
Istio, Linkerd, and cloud ALBs all support traffic splitting. Running multiple routing layers creates conflicting weight assignments and header routing loops. Choose one traffic control plane and route all progressive delivery through it.
-
Not testing rollback paths in staging
Rollbacks fail when configuration drift, secret rotation, or database migrations are not reversible. Validate rollback procedures by simulating canary failure in staging. Ensure down migrations are idempotent and feature flags can disable new behavior without redeployment.
Production Best Practices:
- Define SLOs per service before deployment automation.
- Use canary analysis automation, not timer-only steps.
- Enforce image signing (Cosign/Notary) and SBOM generation.
- Run chaos engineering tests on deployment pipelines quarterly.
- Separate control plane (Argo/Flux) from data plane (Ingress/Mesh).
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-risk internal tooling | RollingUpdate + PDB | Simplicity outweighs traffic control needs; failure impact is contained | Low infra overhead, minimal CI/CD complexity |
| High-traffic customer API | Progressive Canary + ALB/Istio | Requires sub-minute rollback, dynamic weight shifting, and SLO-gated promotion | Moderate control plane cost, high reliability ROI |
| Compliance/financial workloads | Blue/Green + Immutable Audit | Binary state changes simplify compliance verification; traffic split is less critical than deterministic rollback | High infra duplication cost, low incident risk |
| AI/ML inference endpoints | Canary with latency/error thresholds | Model drift and GPU saturation require metric-driven promotion, not replica scaling | GPU cost scales with canary weight, but prevents silent accuracy degradation |
Configuration Template
Copy this bundle to implement progressive delivery with metric analysis. Adjust thresholds to match your SLOs.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
metrics:
- name: error-rate
interval: 30s
failureLimit: 2
successCondition: result[0] < 0.02
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{namespace="{{namespace}}", status=~"5.."}[2m]))
/
sum(rate(http_requests_total{namespace="{{namespace}}"}[2m]))
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
alb.ingress.kubernetes.io/actions.canary-routing: |
{
"Type": "forward",
"ForwardConfig": {
"TargetGroups": [
{"ServiceName": "stable-svc", "ServicePort": "80", "Weight": 90},
{"ServiceName": "canary-svc", "ServicePort": "80", "Weight": 10}
]
}
}
spec:
ingressClassName: alb
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: canary-routing
port:
name: use-annotation
Quick Start Guide
- Install Argo Rollouts:
kubectl apply -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
- Apply the
Rollout and dual-service manifests to your namespace
- Verify controller recognition:
kubectl get rollout api-service -w
- Update the image field in the
Rollout spec and commit. The controller will create canary replicas, shift traffic to 10%, pause for 60s, and evaluate metrics before proceeding.
- Monitor promotion:
kubectl argo rollouts get rollout api-service or use the Argo Rollouts dashboard. Trigger manual rollback with kubectl argo rollouts abort rollout api-service if metrics breach thresholds.