cy against defined thresholds.
3. Database Compatibility: Deployments must support backward and forward compatibility. The application must handle schema versions gracefully during the transition window.
Step-by-Step Implementation
1. Application Instrumentation (TypeScript)
The deployment controller requires metrics to make promotion decisions. The backend service must expose a metrics endpoint compatible with Prometheus.
// metrics.ts
import { Counter, Histogram, register } from 'prom-client';
export const httpRequestsDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 1, 3, 5],
});
export const httpErrorsTotal = new Counter({
name: 'http_errors_total',
help: 'Total number of HTTP errors',
labelNames: ['method', 'route', 'status_code'],
});
// Wrapper to instrument Express/Fastify routes
export const instrumentRoute = (method: string, route: string) => {
return (req: any, res: any, next: any) => {
const end = httpRequestsDuration.startTimer({ method, route });
res.on('finish', () => {
end({ status_code: res.statusCode.toString() });
if (res.statusCode >= 400) {
httpErrorsTotal.inc({ method, route, status_code: res.statusCode.toString() });
}
});
next();
};
};
// Expose metrics endpoint
export const getMetrics = async (req: any, res: any) => {
res.setHeader('Content-Type', register.contentType);
res.send(await register.metrics());
};
2. Kubernetes Rollout Definition
Argo Rollouts extends the Kubernetes Deployment resource with canary-specific fields. This manifest defines the traffic strategy and analysis steps.
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: backend-api-rollout
spec:
replicas: 10
revisionHistoryLimit: 2
selector:
matchLabels:
app: backend-api
template:
metadata:
labels:
app: backend-api
spec:
containers:
- name: backend-api
image: registry/backend-api:stable
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 60s}
- setWeight: 25
- pause: {duration: 60s}
- analysis:
templates:
- templateName: error-rate-analysis
- setWeight: 50
- pause: {duration: 120s}
trafficRouting:
nginx:
stableIngress: backend-api-ingress
stableService: backend-api-stable
canaryService: backend-api-canary
3. Analysis Template
Define the success criteria. If the error rate exceeds 1% or latency p95 exceeds 500ms, the rollout automatically aborts and rolls back.
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-analysis
spec:
metrics:
- name: error-rate
interval: 30s
failureLimit: 2
successCondition: result[0] <= 0.01
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_errors_total{status_code=~"5.."}[2m]))
/
sum(rate(http_requests_total[2m]))
- name: latency-p95
interval: 30s
failureLimit: 3
successCondition: result[0] <= 0.5
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[2m])) by (le))
Rationale
This architecture ensures that traffic is only shifted incrementally. The pause steps allow for manual verification or integration with external systems (e.g., triggering load tests). The analysis runs continuously; if metrics degrade, the controller halts the rollout and reverts traffic to the stable service immediately. The TypeScript instrumentation provides the data fidelity required for accurate analysis, moving beyond simple health checks to business-impact metrics.
Pitfall Guide
1. Breaking Database Schema Compatibility
Mistake: Deploying a migration that removes a column or changes a type while old application instances are still running.
Impact: Runtime errors, data corruption, or service crashes during the overlap window.
Best Practice: Use the Expand/Contract pattern. Phase 1: Expand schema (add columns, make nullable). Deploy code that handles both old and new schemas. Phase 2: Backfill data if needed. Phase 3: Deploy code that uses new schema exclusively. Phase 4: Contract schema (remove old columns).
2. Ignoring Session Affinity in Blue/Green
Mistake: Switching traffic from Blue to Green without accounting for sticky sessions or in-memory caches.
Impact: Users lose session state, resulting in forced logouts or cart abandonment.
Best Practice: Externalize session state to Redis or a database. If sticky sessions are unavoidable, implement a "drain" period or cookie-based migration strategy before the traffic switch.
3. Cold Start Latency Skewing Metrics
Mistake: Canary analysis triggers a rollback because new pods have high latency during initialization, not due to code defects.
Impact: False positive rollbacks, deployment churn.
Best Practice: Configure analysis to ignore the first N seconds of a pod's life or use warm-up probes. Ensure metrics queries account for pod age.
4. Dependency Version Mismatch
Mistake: Deploying a microservice that calls a downstream service with an incompatible API version.
Impact: Cascading failures across the service mesh.
Best Practice: Implement Contract Testing (e.g., Pact) in the CI pipeline. Use versioned APIs and ensure backward compatibility for consumers before deploying producers.
5. Manual Rollback Bottlenecks
Mistake: Relying on an engineer to manually trigger a rollback when alerts fire.
Impact: Extended outage duration due to human reaction time and decision latency.
Best Practice: Automate rollback triggers based on SLO breaches. The deployment controller should be the source of truth for rollback actions.
6. Testing in Production Without Isolation
Mistake: Canary traffic includes internal test bots or non-representative user segments.
Impact: Metrics are polluted, leading to incorrect promotion decisions.
Best Practice: Filter internal traffic from analysis metrics. Use header-based routing for internal testing if needed, but exclude these requests from canary success calculations.
7. Stateful Service Deployment
Mistake: Applying stateless deployment patterns to stateful workloads without partitioning.
Impact: Data loss or consistency violations.
Best Practice: For stateful backends, use Rolling updates with partition strategy or migrate state to external storage. Never use Blue/Green for stateful services unless you have a dual-write replication strategy.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Traffic E-Commerce API | Canary with Automated Analysis | Minimizes blast radius; protects revenue; handles traffic spikes gracefully. | Medium (Incremental infra) |
| Critical DB Migration | Blue/Green + Expand/Contract | Ensures clean switch; allows instant rollback if migration fails; separates schema risk. | High (2x infra during switch) |
| Internal Admin Tool | Blue/Green | Low traffic reduces cost of duplication; instant rollback simplifies ops; low complexity. | Medium (Low absolute cost) |
| Legacy Monolith on VMs | Rolling Update | No native traffic splitting available; cost constraints; acceptable risk for low-criticality. | Low |
| Feature Experimentation | Feature Flags + Canary | Decouples deployment from release; allows A/B testing; reduces deployment risk. | Low (Code complexity cost) |
Configuration Template
Copy this template to implement a production-grade Canary Rollout with Argo Rollouts and Prometheus analysis. Adjust thresholds based on your SLOs.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: production-service
spec:
replicas: 10
selector:
matchLabels:
app: production-service
template:
metadata:
labels:
app: production-service
spec:
containers:
- name: app
image: registry/app:v1.0.0
readinessProbe:
httpGet:
path: /healthz
port: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
strategy:
canary:
maxSurge: "25%"
maxUnavailable: 0
steps:
- setWeight: 5
- pause: {} # Manual checkpoint for critical releases
- analysis:
templates:
- templateName: production-analysis
- setWeight: 20
- pause: {duration: 30s}
- setWeight: 50
- pause: {duration: 60s}
trafficRouting:
nginx:
stableIngress: production-ingress
stableService: production-stable
canaryService: production-canary
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: production-analysis
spec:
metrics:
- name: error-rate
interval: 15s
failureLimit: 3
successCondition: result[0] <= 0.02
provider:
prometheus:
query: |
sum(rate(http_requests_total{status_code=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m]))
- name: p99-latency
interval: 15s
failureLimit: 2
successCondition: result[0] <= 1.0
provider:
prometheus:
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[1m])) by (le))
Quick Start Guide
- Install Argo Rollouts: Deploy the controller to your cluster using
kubectl apply -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml.
- Instrument Your Service: Add the TypeScript metrics middleware to your backend and expose the
/metrics endpoint. Ensure Prometheus scrapes this endpoint.
- Apply the Rollout: Replace
Deployment resources with the Rollout manifest provided in the Configuration Template. Update image references and service names.
- Verify Traffic Routing: Confirm that your Ingress controller is configured to support canary routing. Check that
stableService and canaryService are created.
- Trigger a Release: Update the image in the Rollout spec. Monitor progress using
kubectl argo rollouts get rollout production-service. Verify that traffic shifts incrementally and metrics are analyzed automatically.