s or breaking running instances.
- Expand: Add new schema elements (columns, tables) without removing old ones. Both old and new code must function.
- Dual-Write: Deploy application version V2 that writes to both old and new schema elements.
- Backfill: Migrate existing data from old schema to new schema asynchronously.
- Cutover: Deploy application version V3 that reads from the new schema and stops writing to the old schema.
- Contract: Remove old schema elements and dead code.
- Cleanup: Remove feature flags and dual-write logic.
Implementation Details
1. Feature Flag Service
Feature flags decouple deployment from release. They allow V2 code to run in production while keeping new logic disabled until the cutover phase.
// feature-flag.service.ts
import { Redis } from 'ioredis';
export class FeatureFlagService {
private redis: Redis;
constructor(redisUrl: string) {
this.redis = new Redis(redisUrl);
}
async isEnabled(flagKey: string, context: Record<string, any> = {}): Promise<boolean> {
// Production implementation should include caching and consistent hashing
const value = await this.redis.get(`ff:${flagKey}`);
if (value === 'true') return true;
if (value === 'false') return false;
// Fallback to default or percentage rollout
return this.evaluateRollout(flagKey, context);
}
private async evaluateRollout(key: string, context: Record<string, any>): Promise<boolean> {
// Implement percentage-based rollout logic here
return false;
}
}
2. Database Migration Helper
The migration helper enforces the Expand/Contract contract. It prevents dropping columns or tables if the application is not ready.
// db-migration.helper.ts
import { Pool, QueryResult } from 'pg';
export class MigrationHelper {
private pool: Pool;
constructor(pool: Pool) {
this.pool = pool;
}
// Phase 1: Expand - Add column safely
async expandAddColumn(tableName: string, columnName: string, type: string): Promise<void> {
const sql = `
ALTER TABLE ${tableName}
ADD COLUMN IF NOT EXISTS ${columnName} ${type};
`;
await this.pool.query(sql);
}
// Phase 3: Backfill - Migrate data in batches to avoid locks
async backfillData(
tableName: string,
oldCol: string,
newCol: string,
batchSize: number = 1000
): Promise<void> {
let affected = 0;
do {
const sql = `
UPDATE ${tableName}
SET ${newCol} = ${oldCol}
WHERE ${newCol} IS NULL
AND ${oldCol} IS NOT NULL
LIMIT ${batchSize}
RETURNING id;
`;
const result: QueryResult = await this.pool.query(sql);
affected = result.rowCount || 0;
// Yield to event loop to prevent blocking
await new Promise(resolve => setImmediate(resolve));
} while (affected > 0);
}
// Phase 5: Contract - Drop column only when safe
async contractDropColumn(tableName: string, columnName: string): Promise<void> {
// Verify no application instances reference this column
// This check should be enforced by CI/CD or deployment gate
const sql = `ALTER TABLE ${tableName} DROP COLUMN IF EXISTS ${columnName};`;
await this.pool.query(sql);
}
}
3. Graceful Shutdown and Health Checks
Zero-downtime requires the application to handle SIGTERM signals and deregister from load balancers before terminating. This prevents in-flight requests from being dropped.
// server.ts
import express from 'express';
import { FeatureFlagService } from './feature-flag.service';
import { MigrationHelper } from './db-migration.helper';
const app = express();
let isShuttingDown = false;
// Health check endpoint
app.get('/health', (req, res) => {
if (isShuttingDown) {
return res.status(503).json({ status: 'draining' });
}
res.status(200).json({ status: 'healthy' });
});
// Readiness check for load balancer
app.get('/ready', async (req, res) => {
// Check dependencies
const dbOk = await checkDatabase();
const cacheOk = await checkCache();
if (dbOk && cacheOk) {
res.status(200).json({ status: 'ready' });
} else {
res.status(503).json({ status: 'not_ready' });
}
});
// Graceful shutdown handler
const shutdown = async () => {
if (isShuttingDown) return;
isShuttingDown = true;
console.log('Shutting down gracefully...');
// Stop accepting new requests
// In K8s, this allows Service to remove endpoint from rotation
// Wait for in-flight requests (timeout protection)
await new Promise(resolve => setTimeout(resolve, 5000));
// Close connections
await db.pool.end();
process.exit(0);
};
process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);
app.listen(3000, () => {
console.log('Server running on port 3000');
});
4. Dual-Write Pattern
During the cutover phase, V2 must write to both schemas to ensure data consistency.
// user.repository.ts
import { FeatureFlagService } from './feature-flag.service';
export class UserRepository {
constructor(
private db: any,
private flags: FeatureFlagService
) {}
async saveUser(user: any) {
const useNewSchema = await this.flags.isEnabled('user_new_schema');
// Always write to old schema for backward compatibility
await this.db.query(
'INSERT INTO users (id, name, email) VALUES ($1, $2, $3)',
[user.id, user.name, user.email]
);
// Dual-write if flag is enabled
if (useNewSchema) {
await this.db.query(
'INSERT INTO users_v2 (id, name, email, metadata) VALUES ($1, $2, $3, $4)',
[user.id, user.name, user.email, JSON.stringify(user.metadata)]
);
}
}
async getUser(id: string) {
const useNewSchema = await this.flags.isEnabled('user_new_schema_read');
if (useNewSchema) {
return this.db.query('SELECT * FROM users_v2 WHERE id = $1', [id]);
}
return this.db.query('SELECT * FROM users WHERE id = $1', [id]);
}
}
Architecture Rationale
- Backward Compatibility: Every deployment must be backward compatible with the previous version. V2 must work with V1 data; V1 must not break if V2 writes new data.
- Atomic Cutover: Feature flags provide an atomic switch for the read path. Once the backfill is complete, flipping the flag transitions traffic to the new schema instantly without deployment.
- Stateless Design: Applications must not store session state in memory. Sessions must be externalized to Redis or cookies to support instance rotation.
- Health Check Granularity: Separate
/health (liveness) and /ready (readiness) endpoints. The load balancer should only route traffic to pods passing /ready. During shutdown, /ready fails immediately, stopping traffic routing while /health remains true until termination.
Pitfall Guide
1. Breaking Database Migrations
Mistake: Running ALTER TABLE DROP COLUMN or renaming columns in a single deployment.
Impact: V1 instances crash when querying the missing column.
Fix: Always use Expand/Contract. Add new columns, dual-write, backfill, cutover reads, then drop old columns in a subsequent deployment.
2. Sticky Sessions and In-Memory State
Mistake: Assuming stateless deployment works with in-memory sessions or caches.
Impact: Users lose sessions or see stale data during rolling updates.
Fix: Externalize state to Redis, DynamoDB, or distributed caches. Implement session affinity only if unavoidable, and ensure the load balancer handles session transfer.
3. Health Check Misconfiguration
Mistake: Using a single health endpoint or checking only process uptime.
Impact: Load balancer routes traffic to pods that are starting up or failing dependencies.
Fix: Implement startup probes, liveness probes, and readiness probes. Readiness should check database connectivity and cache status. Configure drain timeout to allow in-flight requests to complete.
4. Feature Flag Leakage
Mistake: Leaving feature flags enabled indefinitely without cleanup.
Impact: Code complexity increases, testing matrix explodes, and dead code causes performance degradation.
Fix: Treat feature flags as technical debt. Set expiration dates. Automate flag cleanup in CI/CD pipelines. Monitor flag usage and remove unused flags.
5. Rollback Strategy Absence
Mistake: Deploying V2 without a tested rollback path.
Impact: If V2 fails, rolling back to V1 causes downtime or data loss due to schema incompatibility.
Fix: Ensure V1 can run alongside V2 data. Test rollback procedures in staging. Automate rollback triggers based on error rate thresholds.
6. DNS Propagation Delays
Mistake: Using DNS-based routing with high TTL for Blue/Green switches.
Impact: Users experience downtime during DNS propagation.
Fix: Use low TTL values (e.g., 60 seconds) for deployment switches. Prefer load balancer-level routing over DNS for critical switches.
7. CI/CD Pipeline Fragility
Mistake: Manual gates or untested deployment scripts.
Impact: Human error introduces downtime; deployments are too slow to support frequent releases.
Fix: Automate all deployment steps. Use infrastructure as code. Implement automated rollback on failure. Test deployment strategies in staging with production-like data.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Stateless microservice, config change | Rolling Update | Low complexity, safe for stateless apps. | Low |
| Database schema migration | Expand/Contract | Only pattern safe for breaking schema changes. | Medium |
| High-risk feature, A/B testing | Canary + Feature Flags | Granular control, instant kill switch, traffic splitting. | High |
| Multi-region deployment | Blue/Green per region | Isolation, fast rollback, region-level safety. | High |
| Legacy monolith, infrequent deploys | Blue/Green | Simplifies rollback, reduces risk for large changes. | Medium |
Configuration Template
Kubernetes Deployment with Rolling Update strategy, readiness probes, and graceful shutdown configuration.
apiVersion: apps/v1
kind: Deployment
metadata:
name: zero-downtime-app
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Never reduce available pods
template:
spec:
terminationGracePeriodSeconds: 30 # Time for graceful shutdown
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"] # Delay SIGTERM
Quick Start Guide
- Add Health Endpoints: Implement
/health and /ready endpoints in your application. /ready must check database and cache connectivity.
- Configure Graceful Shutdown: Add SIGTERM handler to stop accepting requests, wait for in-flight requests, and close connections. Configure orchestrator drain timeout.
- Implement Feature Flags: Integrate a feature flag service. Wrap all new logic in flag checks. Ensure flags can be toggled without redeployment.
- Set Up Database Migrations: Use the Expand/Contract pattern for all schema changes. Write migration scripts that add columns, dual-write, backfill, and cutover reads safely.
- Automate CI/CD: Configure your pipeline to run health checks, validate database compatibility, and trigger rollback on error rate spikes. Test the deployment strategy in staging.