Add new columns, tables, or indexes without altering the old structure. This guarantees that existing application code continues to function while new code targets the expanded schema.
-- Migration v001: Add new column with default
ALTER TABLE users ADD COLUMN email_verified_at TIMESTAMPTZ DEFAULT NULL;
CREATE INDEX CONCURRENTLY idx_users_email_verified ON users(email_verified_at);
CONCURRENTLY is mandatory for production indexes. It prevents table locks by building the index incrementally while allowing concurrent reads and writes.
Step 2: Dual-Write Implementation
Deploy application code that writes to both the old and new schema locations. Use a feature flag or configuration toggle to control the routing logic without requiring a database migration deployment.
interface UserRepository {
save(user: User): Promise<void>;
}
export class DualWriteUserRepository implements UserRepository {
constructor(
private readonly legacyRepo: LegacyUserRepository,
private readonly modernRepo: ModernUserRepository,
private readonly config: MigrationConfig
) {}
async save(user: User): Promise<void> {
await Promise.all([
this.legacyRepo.save(user),
this.config.enableModernWrite ? this.modernRepo.save(user) : Promise.resolve()
]);
}
}
The dual-write layer must handle partial failures gracefully. If the modern write fails, log the discrepancy and queue a reconciliation job. Never block the legacy write.
Step 3: Batched Backfill
Migrate existing data in controlled batches to avoid replication lag and connection pool exhaustion. Use cursor-based pagination with explicit WHERE clauses to prevent full table scans.
export async function backfillEmailVerification(
client: PoolClient,
batchSize: number = 1000,
delayMs: number = 50
): Promise<void> {
let lastId = 0;
let rowsAffected = 0;
do {
const result = await client.query(
`UPDATE users
SET email_verified_at = COALESCE(email_verified_at, created_at)
WHERE id > $1
ORDER BY id ASC
LIMIT $2
RETURNING id`,
[lastId, batchSize]
);
rowsAffected = result.rowCount ?? 0;
if (rowsAffected > 0) {
lastId = result.rows[result.rows.length - 1].id;
}
await new Promise(resolve => setTimeout(resolve, delayMs));
} while (rowsAffected === batchSize);
}
Rate-limiting and explicit ORDER BY id ensure predictable execution plans and prevent MVCC bloat in PostgreSQL.
Step 4: Read Switch & Verification
Once backfill completes and dual-write consistency is verified, switch read operations to the new schema. Validate data integrity using checksums or row counts before disabling the legacy path.
export async function verifyMigrationConsistency(
client: PoolClient
): Promise<boolean> {
const [legacyCount, modernCount] = await Promise.all([
client.query('SELECT COUNT(*) FROM users_legacy'),
client.query('SELECT COUNT(*) FROM users_modern')
]);
const legacy = parseInt(legacyCount.rows[0].count, 10);
const modern = parseInt(modernCount.rows[0].count, 10);
return legacy === modern;
}
Step 5: Schema Contraction
After confirming zero read/write traffic targets the old schema for a defined stabilization period, remove the deprecated structures.
-- Migration v002: Contract phase
DROP INDEX IF EXISTS idx_users_legacy_email;
ALTER TABLE users DROP COLUMN IF EXISTS legacy_email;
Contraction must never run until all application instances are running the post-switch code. Use deployment manifests to enforce version alignment.
Architecture Decisions & Rationale
- Idempotent migrations: Every migration must be safe to run multiple times. Use
IF NOT EXISTS, DO $$ ... $$ LANGUAGE plpgsql; blocks, or explicit version tracking tables.
- Transaction boundaries: DDL in PostgreSQL auto-commits, breaking transactional guarantees. Split schema changes and data migrations into separate scripts with explicit dependency ordering.
- State tracking: Maintain a
schema_migrations table with version, checksum, and execution timestamp. Prevent drift by rejecting out-of-order or duplicate runs.
- Connection pooling: Migrations must use a dedicated connection pool with higher timeouts and lower concurrency limits to avoid starving application traffic.
Pitfall Guide
- Non-idempotent migration scripts: Running a migration twice due to deployment retries or pipeline flakiness causes duplicate constraints, orphaned indexes, or data corruption. Always wrap DDL in conditional checks or use migration runners that track execution state.
- Online DDL without
CONCURRENTLY: Creating indexes without the concurrent flag blocks writes for the duration of the build. On tables with millions of rows, this triggers connection queue saturation and cascading timeouts.
- Coupling schema changes to code deployments: Tying a migration to a specific application release forces rollback synchronization. If the code fails, the database remains in a transitional state. Decouple them using feature flags and backward-compatible schema expansions.
- Skipping backfill verification: Assuming row counts match after a migration ignores data type mismatches, trigger side effects, and replication lag. Always run checksum validation or sample audits before switching read paths.
- Ignoring replication lag in distributed clusters: In multi-node PostgreSQL or MySQL setups,
ALTER TABLE statements replicate asynchronously. Switching reads before lag clears returns stale data. Monitor pg_stat_replication or show slave status before cutover.
- Unbounded UPDATE statements: Migrating data without
LIMIT and ORDER BY causes full table scans, MVCC bloat, and lock escalation. Batch updates with explicit cursors and rate limiting.
- Missing rollback automation: Manual rollback procedures increase MTTR during incidents. Bake
down migrations into the same version-controlled file, and test them against production-like data volumes before deployment.
Production Best Practices:
- Run migrations against a restored production snapshot before targeting live clusters.
- Use advisory locks (
pg_advisory_lock) to prevent concurrent migration executions across multiple application instances.
- Monitor
pg_stat_activity and pg_locks during migration windows to detect blocking sessions early.
- Set
statement_timeout and lock_timeout in migration sessions to prevent runaway operations.
- Maintain a migration runbook with explicit cutover criteria, rollback triggers, and escalation paths.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single-region PostgreSQL, <10M rows | Big Bang with maintenance window | Fast execution, low complexity, acceptable downtime | Low operational cost, moderate downtime cost |
| Multi-region PostgreSQL/CockroachDB | Expand/Contract with dual-write | Replication lag requires decoupled schema evolution | Higher engineering cost, near-zero downtime cost |
| Zero-downtime SLA required | Dual-write + feature flag cutover | Eliminates lock contention, enables instant rollback | Moderate infrastructure cost, high reliability ROI |
| Legacy monolith with tight coupling | Strangler Fig pattern | Gradual service extraction avoids monolithic schema locks | High initial refactoring cost, long-term scalability gain |
| Small team, limited CI/CD | Backward-compatible expand/contract | Reduces rollback complexity, simplifies deployment coordination | Low tooling cost, reduced incident response overhead |
Configuration Template
// migrations/config.ts
import { PoolConfig } from 'pg';
export interface MigrationConfig {
db: PoolConfig;
migrationsDir: string;
table: string;
lockTimeoutMs: number;
statementTimeoutMs: number;
batchSize: number;
backfillDelayMs: number;
enableModernWrite: boolean;
}
export const defaultMigrationConfig: MigrationConfig = {
db: {
host: process.env.DB_HOST || 'localhost',
port: parseInt(process.env.DB_PORT || '5432', 10),
database: process.env.DB_NAME || 'app_db',
user: process.env.DB_USER || 'postgres',
password: process.env.DB_PASSWORD || '',
max: 2,
idleTimeoutMillis: 5000,
},
migrationsDir: './migrations',
table: 'schema_migrations',
lockTimeoutMs: 30000,
statementTimeoutMs: 60000,
batchSize: 1000,
backfillDelayMs: 50,
enableModernWrite: process.env.ENABLE_MODERN_WRITE === 'true',
};
// migrations/runner.ts
import { Pool, PoolClient } from 'pg';
import { MigrationConfig } from './config';
export class MigrationRunner {
private pool: Pool;
constructor(private readonly config: MigrationConfig) {
this.pool = new Pool(config.db);
}
async acquireClient(): Promise<PoolClient> {
const client = await this.pool.connect();
await client.query(`SET lock_timeout = '${this.config.lockTimeoutMs}ms'`);
await client.query(`SET statement_timeout = '${this.config.statementTimeoutMs}ms'`);
return client;
}
async ensureMigrationTable(): Promise<void> {
const client = await this.acquireClient();
try {
await client.query(`
CREATE TABLE IF NOT EXISTS ${this.config.table} (
version VARCHAR(255) PRIMARY KEY,
checksum VARCHAR(64) NOT NULL,
executed_at TIMESTAMPTZ DEFAULT NOW()
)
`);
} finally {
client.release();
}
}
async isExecuted(version: string): Promise<boolean> {
const client = await this.acquireClient();
try {
const res = await client.query(
`SELECT 1 FROM ${this.config.table} WHERE version = $1`,
[version]
);
return res.rowCount > 0;
} finally {
client.release();
}
}
async recordExecution(version: string, checksum: string): Promise<void> {
const client = await this.acquireClient();
try {
await client.query(
`INSERT INTO ${this.config.table} (version, checksum) VALUES ($1, $2)`,
[version, checksum]
);
} finally {
client.release();
}
}
async close(): Promise<void> {
await this.pool.end();
}
}
Quick Start Guide
- Initialize the migration table: Run
MigrationRunner.ensureMigrationTable() against your target database. This creates the version tracking schema required for idempotent execution.
- Configure environment variables: Set
DB_HOST, DB_NAME, DB_USER, DB_PASSWORD, and ENABLE_MODERN_WRITE=false in your deployment environment. Adjust batchSize and backfillDelayMs based on your cluster's write capacity.
- Execute expansion migration: Run your first migration script (e.g.,
ALTER TABLE ... ADD COLUMN). The runner verifies idempotency, records the version, and applies the schema change without blocking application traffic.
- Deploy dual-write code: Release the application update with the dual-write repository and feature flag disabled. Monitor write latency and error rates for 24 hours before proceeding.
- Run backfill and verify: Execute the batched backfill job, then run
verifyMigrationConsistency(). Once checksums match, toggle ENABLE_MODERN_WRITE=true and switch read paths. Schedule contraction after 7 days of stable traffic.