backend source to versioned path
execSync(rsync -a --delete src/ ${BACKEND_DEST}/, { stdio: 'inherit' });
console.log(Artifacts prepared: ${TIMESTAMP});
**Rationale:** Timestamped directories guarantee that no two deployments share the same path. This eliminates file-lock contention and allows instant rollback by reverting the symlink.
### 2. Clustered Process Management with Rolling Reloads
PM2's cluster mode spawns multiple worker processes bound to the same port. When `pm2 reload` is invoked, PM2 starts a new worker, waits for it to become healthy, routes new connections to it, and gracefully terminates the old worker. This ensures at least one process handles traffic at all times.
```bash
# Start clustered API
pm2 start src/server.js -i 4 --name "platform-api" --max-memory-restart 512M
# Trigger rolling reload
pm2 reload platform-api --update-env
Rationale: The -i 4 flag matches typical CPU core counts, maximizing throughput. The --update-env flag ensures environment variables injected during deployment are propagated to workers without a full restart.
3. Atomic Frontend Routing via Symlinks
Nginx should never read directly from a build directory. Instead, maintain a current symlink that points to the active version. Updating the symlink is an atomic filesystem operation that completes in microseconds.
# Initial setup
ln -sfn /opt/apps/platform/frontend/1715000000 /opt/apps/platform/frontend/current
# After new build completes
ln -sfn /opt/apps/platform/frontend/1715000060 /opt/apps/platform/frontend/current
Nginx configuration:
location / {
root /opt/apps/platform/frontend/current;
try_files $uri $uri/ /index.html;
}
Rationale: ln -sfn atomically replaces the symlink target. Nginx reads the new path on the next request cycle without reloading or restarting. Combined with chunk hashing in Vite, this prevents stale asset delivery.
4. Graceful Termination Handling
PM2 sends SIGINT during reloads. The Express server must intercept this signal, stop accepting new connections, allow in-flight requests to complete, and then exit.
import express from 'express';
import http from 'http';
const app = express();
const server = http.createServer(app);
const PORT = process.env.API_PORT || 3000;
server.listen(PORT, () => {
console.log(`API listening on port ${PORT}`);
});
const gracefulShutdown = (signal: string) => {
console.log(`Received ${signal}. Initiating graceful shutdown...`);
server.close((err) => {
if (err) {
console.error('Forced shutdown due to timeout');
process.exit(1);
}
console.log('All connections drained. Exiting.');
process.exit(0);
});
// Safety valve: force exit after 10 seconds
setTimeout(() => {
console.error('Shutdown timeout exceeded. Forcing exit.');
process.exit(1);
}, 10000);
};
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
Rationale: Handling both SIGINT and SIGTERM ensures compatibility with PM2, systemd, and container orchestrators. The 10-second timeout prevents zombie processes from blocking deployments indefinitely.
5. Externalized Session Management
In-memory session stores break during process reloads because worker memory is not shared. Externalizing sessions to Redis ensures continuity across rolling updates.
import session from 'express-session';
import RedisStore from 'connect-redis';
import { createClient } from 'redis';
const redisClient = createClient({ url: process.env.REDIS_URL });
redisClient.connect();
app.use(
session({
store: new RedisStore({ client: redisClient }),
secret: process.env.SESSION_SECRET!,
resave: false,
saveUninitialized: false,
cookie: { secure: process.env.NODE_ENV === 'production', httpOnly: true, maxAge: 86400000 }
})
);
Rationale: Redis acts as a single source of truth for session state. When PM2 rotates workers, new processes read existing sessions from Redis, preserving authentication and user context.
6. Database Migration Strategy
Schema changes cannot be applied atomically alongside code deployments. Use the expand/contract pattern:
- Deploy code that supports both old and new schema (expand)
- Run migrations to add columns/tables
- Deploy code that removes deprecated schema references (contract)
// migration-runner.ts
import { runMigrations } from './db/migrate';
async function preDeployValidation() {
const isCompatible = await checkSchemaCompatibility();
if (!isCompatible) {
console.warn('Schema mismatch detected. Running safe migrations...');
await runMigrations({ direction: 'up', lockTimeout: 30000 });
}
}
preDeployValidation().catch(console.error);
Rationale: This approach prevents runtime errors caused by missing columns or type mismatches. The lock timeout prevents concurrent migration processes from corrupting the database state.
Pitfall Guide
1. In-Memory Session Storage
Explanation: Storing sessions in Node.js process memory means every rolling reload invalidates active user sessions. Users are forced to re-authenticate, triggering support tickets and trust erosion.
Fix: Externalize session state to Redis, Memcached, or a managed session service. Configure connect-redis or equivalent adapters with connection pooling and retry logic.
2. Ignoring SIGTERM vs SIGINT
Explanation: PM2 sends SIGINT during reloads, but cloud platforms (AWS, GCP, Kubernetes) send SIGTERM during scaling events or health check failures. Handling only one signal leaves the process vulnerable to hard kills.
Fix: Register handlers for both SIGINT and SIGTERM. Ensure the drain logic is identical and includes a hard timeout to prevent deployment hangs.
3. Stale Nginx Cache Delivery
Explanation: Browsers cache index.html aggressively. If the symlink updates but the client holds a cached version, it loads outdated JavaScript chunks, causing runtime errors or missing features.
Fix: Set Cache-Control: no-cache for index.html and rely on content-hash filenames for JS/CSS assets. Vite and Webpack handle chunk hashing automatically; verify the Nginx config does not override it.
4. Database Schema Incompatibility
Explanation: Deploying code that expects a new column before the migration runs causes immediate 500 errors. Conversely, running migrations before backward-compatible code is deployed breaks the old version.
Fix: Adopt the expand/contract pattern. Always deploy compatible code first, run migrations, then deploy the cleanup version. Use feature flags to toggle new schema usage.
5. WebSocket Connection Drops
Explanation: Rolling reloads terminate TCP connections. Clients using raw WebSockets experience abrupt disconnections without automatic recovery.
Fix: Implement client-side reconnection logic with exponential backoff. For Socket.IO, use the Redis adapter to broadcast state across workers and enable automatic reconnection handling.
6. Environment Variable Staleness
Explanation: PM2 caches environment variables at startup. Updating .env files or system variables without reloading the process manager leaves workers using outdated configuration.
Fix: Always use pm2 reload --update-env or define variables in an ecosystem file (ecosystem.config.js). Validate variable propagation in CI/CD logs before routing traffic.
7. Insufficient Grace Period
Explanation: The default 10-second shutdown timeout may be too short for long-running requests (file uploads, report generation, third-party API calls). Premature termination causes data loss and client errors.
Fix: Tune kill_timeout in the PM2 ecosystem configuration to match your longest expected request. Monitor average response times and add a 20% buffer.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small team, single server | PM2 cluster + Nginx symlink | Low operational overhead, proven reliability | Minimal (disk I/O increase) |
| High traffic, multi-core | PM2 cluster + Redis sessions + HAProxy | Distributes load, preserves state across nodes | Moderate (Redis instance, LB config) |
| Containerized/Kubernetes | Rolling updates + readiness probes | Native orchestration, no process manager needed | Higher (cluster resources, monitoring) |
| Strict compliance/audit | Blue-green deployment + immutable artifacts | Instant rollback, full version traceability | High (duplicate infrastructure, storage) |
Configuration Template
// ecosystem.config.js
module.exports = {
apps: [{
name: 'platform-api',
script: 'src/server.js',
instances: 'max',
exec_mode: 'cluster',
max_memory_restart: '512M',
kill_timeout: 15000,
wait_ready: true,
listen_timeout: 5000,
env_production: {
NODE_ENV: 'production',
API_PORT: 3000,
REDIS_URL: 'redis://127.0.0.1:6379',
SESSION_SECRET: process.env.SESSION_SECRET
}
}]
};
# /etc/nginx/sites-available/platform.conf
server {
listen 80;
server_name api.example.com;
location / {
root /opt/apps/platform/frontend/current;
try_files $uri $uri/ /index.html;
# Prevent HTML caching
if ($uri ~* \.html$) {
add_header Cache-Control "no-cache, no-store, must-revalidate";
}
}
location /api/ {
proxy_pass http://127.0.0.1:3000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
proxy_read_timeout 60s;
}
}
Quick Start Guide
- Initialize versioned directories: Create
/opt/apps/platform/frontend and /opt/apps/platform/backend. Set up a cron job or CI step to build artifacts into timestamped subdirectories.
- Configure PM2 ecosystem: Place
ecosystem.config.js in your project root. Run pm2 start ecosystem.config.js --env production to launch clustered workers.
- Set up Nginx symlink: Create the
current symlink pointing to your initial build. Update Nginx config to serve from /opt/apps/platform/frontend/current and reload Nginx.
- Test graceful reload: Run
pm2 reload platform-api --update-env while sending continuous requests (while true; do curl http://localhost/api/health; sleep 0.1; done). Verify zero errors in logs.
- Deploy pipeline integration: Wrap artifact preparation, symlink swap, and PM2 reload into a single CI/CD script. Add pre-deploy migration checks and post-deploy health verification.