tion pattern**.
Architecture Decisions:
- Service Mesh (Istio): Chosen for granular traffic splitting based on headers and weights. Allows dynamic adjustment of canary percentage without redeploying pods.
- Feature Flag Service (LaunchDarkly/Unleash): Decouples deployment from release. Allows new code paths to be deployed but disabled, enabling safe database expansions.
- Expand/Contract Pattern: Ensures zero downtime during schema changes by maintaining backward compatibility throughout the migration lifecycle.
Step-by-Step Implementation:
1. Database Migration: Expand/Contract Pattern
Never drop columns or rename tables in a single deployment. The migration must span multiple deployments.
- Phase 1: Expand. Add new column, keep old column. Dual-write to both.
- Phase 2: Backfill. Migrate data from old to new column.
- Phase 3: Switch. Read from new column. Stop writing to old column.
- Phase 4: Contract. Remove old column and dual-write logic.
TypeScript Implementation of Dual-Write Migration Manager:
import { Pool, PoolClient } from 'pg';
export class MigrationManager {
constructor(private pool: Pool) {}
async expandSchema(client: PoolClient): Promise<void> {
// Phase 1: Expand
// Add new column as nullable to maintain backward compatibility
await client.query(`
ALTER TABLE orders
ADD COLUMN IF NOT EXISTS new_payment_status VARCHAR(50),
ADD COLUMN IF NOT EXISTS old_payment_status VARCHAR(50);
`);
// Create index on new column for performance
await client.query(`
CREATE INDEX IF NOT EXISTS idx_orders_new_payment_status
ON orders(new_payment_status);
`);
}
async dualWriteOrder(client: PoolClient, orderId: string, status: string): Promise<void> {
// Application logic must write to both columns during Expand phase
await client.query(`
UPDATE orders
SET old_payment_status = $1,
new_payment_status = $1
WHERE id = $2
`, [status, orderId]);
}
async backfillData(client: PoolClient): Promise<void> {
// Phase 2: Backfill
// Migrate existing data to new column
// Run in batches to avoid locking
await client.query(`
UPDATE orders
SET new_payment_status = old_payment_status
WHERE new_payment_status IS NULL
AND old_payment_status IS NOT NULL
LIMIT 1000
`);
}
async switchReads(client: PoolClient): Promise<void> {
// Phase 3: Switch
// Application code changes to read from new_payment_status
// Feature flag controls the switch
console.log('Switching reads to new_payment_status');
}
async contractSchema(client: PoolClient): Promise<void> {
// Phase 4: Contract
// Remove old column and dual-write logic
// Only safe after all instances are running the new code
await client.query(`
ALTER TABLE orders
DROP COLUMN IF EXISTS old_payment_status,
DROP COLUMN IF EXISTS idx_orders_old_payment_status;
`);
}
}
2. Canary Traffic Splitting with Istio
Istio VirtualService defines the traffic routing. The canary weight is adjusted via API or GitOps pipeline based on metrics.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: payment-service-vs
spec:
hosts:
- payment-service
http:
- route:
- destination:
host: payment-service
subset: stable
weight: 90
- destination:
host: payment-service
subset: canary
weight: 10
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx
3. Feature Flag Integration
Feature flags allow the new code path to be deployed but disabled. This enables the Expand phase to occur without changing application behavior immediately.
import { LDClient } from 'launchdarkly-node-server-sdk';
const ldClient = LDClient.init('sdk-key');
export class PaymentService {
async processPayment(orderId: string, amount: number) {
const userKey = `user_${orderId}`;
// Check feature flag for new payment flow
const isNewFlowEnabled = await ldClient.variation(
'payment-new-flow',
{ key: userKey },
false
);
if (isNewFlowEnabled) {
// New logic with expanded schema
return this.processNewFlow(orderId, amount);
} else {
// Legacy logic
return this.processLegacyFlow(orderId, amount);
}
}
private async processNewFlow(orderId: string, amount: number) {
const client = await this.pool.connect();
try {
await client.query('BEGIN');
// Write to both columns during dual-write phase
await this.migrationManager.dualWriteOrder(client, orderId, 'processing');
// New business logic using new_payment_status
const result = await this.executeNewGateway(client, orderId, amount);
await client.query('COMMIT');
return result;
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release();
}
}
}
4. Automated Canary Analysis
Promotion of the canary is driven by metrics, not time. A pipeline step analyzes error rates and latency.
// Pseudo-code for CI/CD pipeline validation
async function validateCanary(canaryVersion: string): Promise<boolean> {
const metrics = await prometheusClient.queryRange({
query: 'rate(http_requests_total{status=~"5..", version="canary"}[5m])',
start: '-10m',
end: 'now'
});
const errorRate = metrics.result[0]?.values.reduce((sum, val) => sum + val[1], 0) / metrics.result[0]?.values.length;
if (errorRate > 0.01) { // > 1% error rate
await rollbackCanary(canaryVersion);
return false;
}
const latencyP99 = await prometheusClient.query(`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{version="canary"}[5m]))`);
if (latencyP99 > 0.5) { // > 500ms
await rollbackCanary(canaryVersion);
return false;
}
return true;
}
Pitfall Guide
Production deployments fail due to subtle interactions between components. The following pitfalls are derived from ScaleRetail's incident reports.
-
Database Schema Incompatibility:
- Mistake: Removing a column or changing a type without backward compatibility.
- Impact: Immediate 500 errors on read/write. Rollback requires database restoration.
- Best Practice: Enforce Expand/Contract pattern. Never drop columns in the same deployment as the switch.
-
Connection Pool Exhaustion:
- Mistake: New pods start before old pods terminate, causing a spike in database connections.
- Impact: Database rejects new connections; service becomes unresponsive.
- Best Practice: Configure
maxSurge and maxUnavailable in Deployment specs carefully. Implement connection pooling with max limits. Use preStop sleep hooks to allow in-flight requests to drain.
-
Incomplete Health Checks:
- Mistake: Readiness probes only check HTTP 200, not dependency health (DB, Cache, External APIs).
- Impact: Traffic routed to pods that cannot process requests, causing cascading failures.
- Best Practice: Implement deep health checks that verify connectivity to critical dependencies.
-
Session Affinity Loss:
- Mistake: Blue-Green or Canary deployments disrupt sticky sessions for stateful apps.
- Impact: Users forced to re-authenticate; cart data lost.
- Best Practice: Externalize session state to Redis. Avoid IP-based affinity.
-
Rollback Blindness:
- Mistake: Manual rollback process or lack of automated triggers.
- Impact: Extended downtime while engineers diagnose and react.
- Best Practice: Automate rollback based on error rate and latency thresholds. Ensure rollback is a one-click or automatic action.
-
Configuration Drift:
- Mistake: New version requires environment variables or secrets not present in the cluster.
- Impact: Pods crash loop; deployment hangs.
- Best Practice: Validate configuration completeness in CI. Use ConfigMaps and Secrets versioning.
-
DNS Propagation Delays:
- Mistake: Switching DNS records without considering TTL.
- Impact: Clients continue routing to old version; inconsistent behavior.
- Best Practice: Use low TTLs during deployment windows. Prefer service mesh routing over DNS switching for internal traffic.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Stateless API Service | Blue-Green | Simplest implementation; instant rollback; no state to coordinate. | High (100% infra spike) |
| DB-Heavy Migration | Canary + Expand/Contract | Minimizes risk of data corruption; allows gradual validation of schema changes. | Low (+15% infra) |
| Frontend SPA | Canary with CDN | Users can be routed by cookie or header; easy to invalidate cache. | Low |
| Critical Payment Service | Canary + Feature Flags | Maximum control; can disable specific features instantly without rollback. | Low (+15% infra) |
| Legacy Monolith | Rolling Update with Feature Flags | Blue-Green may be too expensive; rolling updates reduce cost while flags mitigate risk. | Low |
Configuration Template
Istio VirtualService for Canary with Auto-Promotion Hook:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: api-gateway-vs
annotations:
# Hook for CI/CD to trigger canary promotion
deployment.kubernetes.io/canary-promotion: "true"
spec:
hosts:
- api.scale-retail.com
gateways:
- api-gateway
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: api-service
subset: canary
- route:
- destination:
host: api-service
subset: stable
weight: 95
- destination:
host: api-service
subset: canary
weight: 5
GitHub Actions Pipeline Snippet:
name: Canary Deployment
on:
push:
branches: [main]
jobs:
deploy-canary:
runs-on: ubuntu-latest
steps:
- name: Deploy Canary
run: |
kubectl set image deployment/api-service canary=registry.io/api:${{ github.sha }}
kubectl apply -f istio/virtual-service-canary.yaml
- name: Wait for Stabilization
run: sleep 120
- name: Validate Metrics
run: |
# Call validation API or script
./scripts/validate-canary.sh
- name: Promote Canary
if: success()
run: |
kubectl apply -f istio/virtual-service-promote.yaml
Quick Start Guide
- Install Service Mesh: Deploy Istio to your Kubernetes cluster using
istioctl install.
- Define VirtualService: Create a
VirtualService resource with canary routing rules and weight distribution.
- Add Health Checks: Implement deep health checks in your application that verify database and cache connectivity. Expose
/healthz endpoint.
- Run Initial Deployment: Deploy the canary subset with 5% traffic weight. Monitor error rates and latency for 5 minutes.
- Promote or Rollback: If metrics are healthy, update weights to 100% canary. If errors occur, trigger automatic rollback to stable version.
Zero-downtime deployment requires discipline in database migrations, rigorous monitoring, and automated validation. By adopting Canary deployments with Feature Flags and the Expand/Contract pattern, teams can achieve high velocity without compromising reliability. The investment in these practices pays off in reduced outage risk and faster recovery times.