rations must run in small batches with delays to prevent impacting production query latency.
4. Separation of Concerns: Schema changes (DDL) and data changes (DML) should be managed distinctly. DDL should be non-blocking where possible; DML should be throttled.
Step-by-Step Implementation
Phase 1: Expand
Add the new schema element without removing the old one. Implement dual-write logic in the application.
Scenario: Migrating user.profile JSONB column to a normalized user_settings table.
1. Create New Table (Migration Script):
// migrations/001_create_user_settings.ts
import { MigrationBuilder } from 'node-pg-migrate';
export async function up(pgm: MigrationBuilder): Promise<void> {
// Use CONCURRENTLY for indexes to avoid locking
pgm.createTable('user_settings', {
id: { type: 'uuid', primaryKey: true },
user_id: { type: 'uuid', notNull: true },
theme: { type: 'varchar(50)' },
notifications_enabled: { type: 'boolean', default: true },
created_at: { type: 'timestamp', default: pgm.func('CURRENT_TIMESTAMP') },
});
pgm.createIndex('user_settings', 'user_id', {
unique: false,
name: 'idx_user_settings_user_id'
});
}
export async function down(pgm: MigrationBuilder): Promise<void> {
pgm.dropTable('user_settings');
}
2. Implement Dual-Write Service:
The application must write to both the legacy and new structures.
// services/UserService.ts
import { db } from '../db/client';
import { FeatureFlags } from '../feature-flags';
export class UserService {
async updateProfile(userId: string, updates: Partial<UserProfileDto>): Promise<void> {
// Always write to legacy schema for backward compatibility
await db.user.update({
where: { id: userId },
data: { profile: updates },
});
// Dual-write to new schema
// Check feature flag to enable gradual rollout if needed,
// but for migration, dual-write should be always on.
await db.userSettings.upsert({
where: { user_id: userId },
update: { ...updates },
create: { user_id: userId, ...updates },
});
}
}
Phase 2: Backfill
Migrate existing data from the old schema to the new schema in the background.
// jobs/BackfillUserSettingsJob.ts
import { db } from '../db/client';
import { sleep } from '../utils';
const BATCH_SIZE = 500;
const DELAY_MS = 200; // Throttle to reduce load
export async function runBackfill(): Promise<void> {
let offset = 0;
let hasMore = true;
while (hasMore) {
const users = await db.user.findMany({
select: { id: true, profile: true },
where: { profile: { not: null } },
skip: offset,
take: BATCH_SIZE,
});
if (users.length === 0) {
hasMore = false;
break;
}
// Upsert in batches
await db.$transaction(
users.map((user) =>
db.userSettings.upsert({
where: { user_id: user.id },
update: {
theme: user.profile.theme,
notifications_enabled: user.profile.notifications_enabled,
},
create: {
user_id: user.id,
theme: user.profile.theme,
notifications_enabled: user.profile.notifications_enabled,
},
})
)
);
offset += users.length;
// Log progress and throttle
console.log(`Backfilled ${offset} records...`);
await sleep(DELAY_MS);
}
console.log('Backfill complete.');
}
Phase 3: Switch
Update the application to read from the new schema. Use a feature flag to control the switch.
// services/UserService.ts
export class UserService {
async getProfile(userId: string): Promise<UserProfileDto> {
// Feature flag controls read path
if (await FeatureFlags.isEnabled('read_user_settings_v2', userId)) {
const settings = await db.userSettings.findUnique({
where: { user_id: userId },
});
return this.transformToDto(settings);
}
// Fallback to legacy
const user = await db.user.findUnique({
where: { id: userId },
select: { profile: true },
});
return this.transformToDto(user.profile);
}
}
Phase 4: Contract
Once the new schema is stable and traffic is fully migrated:
- Remove dual-write logic.
- Remove legacy column/table via migration.
- Clean up feature flags.
// migrations/002_drop_user_profile_column.ts
export async function up(pgm: MigrationBuilder): Promise<void> {
// Ensure application code no longer references this column
pgm.dropColumn('user', 'profile');
}
Pitfall Guide
1. Executing Schema Changes Without CONCURRENTLY
Mistake: Running CREATE INDEX or ALTER TABLE without options that avoid exclusive locks.
Impact: The database table becomes inaccessible to reads and writes for the duration of the operation. On large tables, this causes API timeouts and cascading failures.
Best Practice: Always use CONCURRENTLY for index creation. For column additions, add columns as nullable first; populate data; then add constraints. Use online schema change tools (like gh-ost or pt-online-schema-change) for MySQL if native online DDL is insufficient.
2. Ignoring Replication Lag
Mistake: Writing to the primary and immediately reading from a replica during migration.
Impact: Read-your-writes consistency violations. The application may fail to find data that was just inserted, leading to logic errors or user-facing "not found" errors.
Best Practice: During migrations, route critical reads to the primary database. If using replicas, implement read-after-write consistency checks or disable replica routing for migrated entities until lag is verified to be zero.
3. Lack of Idempotent Backfill Jobs
Mistake: Writing backfill scripts that crash on duplicates or partial states.
Impact: If a backfill job crashes after processing 50% of data, restarting it fails or creates duplicate records. This halts migration progress and corrupts data integrity.
Best Practice: Use UPSERT or INSERT ... ON CONFLICT DO UPDATE semantics. Design jobs to be restartable. Log the last processed ID and resume from that checkpoint.
4. Testing Migrations Only on Seed Data
Mistake: Running migration scripts against a small, synthetic dataset in staging.
Impact: Performance characteristics differ drastically with data volume. A migration that takes seconds on 1,000 rows may take hours on 100 million rows, causing timeouts or lock escalation in production.
Best Practice: Use data masking to restore a production snapshot to staging. Validate migration duration and lock behavior against production-scale data.
5. Hardcoded Assumptions About Data Types
Mistake: Assuming all data fits the new schema constraints.
Impact: Migration fails midway due to constraint violations (e.g., string truncation, null values in non-nullable columns). This leaves the database in a partially migrated state.
Best Practice: Run pre-migration validation queries to identify data that violates new constraints. Implement data cleansing scripts to fix anomalies before applying structural changes.
6. Deploying Code and Schema Simultaneously Without Flags
Mistake: Deploying application code that expects the new schema at the exact same time as the migration.
Impact: If the migration takes longer than expected, or if code deployment rolls out incrementally, instances running old code will encounter missing columns, and instances running new code will encounter missing data.
Best Practice: Decouple deployments. Apply schema changes first. Deploy code with feature flags disabled. Enable flags gradually. Never rely on deployment timing for consistency.
7. No Automated Rollback Strategy
Mistake: Assuming "we can just revert the git commit."
Impact: Reverting code does not revert the database. If the migration has already run, the database schema is ahead of the code, causing immediate crashes upon rollback.
Best Practice: Maintain down migrations that are tested as rigorously as up migrations. In the Expand/Contract pattern, rollback is safe because you simply disable the feature flag and the old schema remains valid. Never drop legacy structures until the new schema is fully verified and stable.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Adding Nullable Column | Online DDL + Backfill | Low risk; online DDL avoids locks. Backfill populates data safely. | Low |
| Changing Column Type | Expand/Contract with Dual-Write | Type changes often require data transformation. Dual-write ensures consistency during transition. | Medium |
| Splitting Table | Expand/Contract + CDC | Complex data movement requires background sync. CDC ensures no data loss during split. | High |
| High-Volume Delete | Batch Delete with Throttling | Large deletes cause bloat and lock contention. Batching minimizes impact. | Low |
| Engine Migration | CDC/Sync + Cutover | Changing engines requires full data sync. CDC handles continuous replication until cutover. | Very High |
Configuration Template
Template for a robust migration runner configuration in TypeScript. This enforces safety constraints.
// config/migration-runner.ts
export interface MigrationConfig {
// Database connection
databaseUrl: string;
// Safety constraints
maxExecutionTimeMs: number; // Fail migration if it takes too long
lockTimeoutMs: number; // Abort if lock cannot be acquired quickly
batchSize: number; // Batch size for backfill jobs
backfillDelayMs: number; // Delay between batches to throttle load
// Rollback settings
enableAutoRollback: boolean; // Automatically run down migration on failure
rollbackTimeoutMs: number; // Max time allowed for rollback
// Monitoring
metricsEndpoint: string; // URL to report migration progress
alertingWebhook: string; // Slack/PagerDuty webhook for failures
}
export const productionMigrationConfig: MigrationConfig = {
databaseUrl: process.env.DATABASE_URL!,
maxExecutionTimeMs: 300_000, // 5 minutes max for DDL
lockTimeoutMs: 5_000, // 5 seconds lock wait
batchSize: 1000,
backfillDelayMs: 100,
enableAutoRollback: false, // Manual approval recommended for prod rollback
rollbackTimeoutMs: 60_000,
metricsEndpoint: 'https://metrics.internal/api/v1/migrations',
alertingWebhook: process.env.SLACK_WEBHOOK!,
};
// Usage in migration script
import { runMigration } from '@codcompass/migration-core';
import { productionMigrationConfig } from './config/migration-runner';
runMigration({
config: productionMigrationConfig,
up: async () => { /* migration logic */ },
down: async () => { /* rollback logic */ },
});
Quick Start Guide
-
Initialize Migration Tooling:
Install a migration library (e.g., node-pg-migrate, Drizzle, or Prisma) and configure the connection string.
npm install node-pg-migrate
npx node-pg-migrate init
-
Create First Migration:
Generate a migration file and define the schema change using the Expand pattern.
npx node-pg-migrate create add_user_settings_table
# Edit the generated file with up/down logic
-
Run Dry-Run:
Execute the migration in a dry-run mode to verify SQL generation and syntax.
npx node-pg-migrate up --dry-run
-
Apply to Staging:
Run the migration against the staging environment and validate application behavior.
npx node-pg-migrate up --env staging
-
Deploy Backfill Job:
Deploy the backfill worker with rate limiting enabled. Monitor progress via logs or metrics dashboard.
kubectl apply -f k8s/backfill-job.yaml
This structure provides a complete, production-ready guide to database migrations, emphasizing resilience, zero-downtime patterns, and operational safety. By adhering to these practices, engineering teams can evolve their data models with confidence and minimize risk to system availability.