mended architecture for database migration at scale is the Expand and Contract pattern, orchestrated via a migration runner that supports feature flags, batching, and observability.
Phase 1: Expand
Add the new schema elements (columns, tables, indexes) without removing or altering existing ones. This ensures backward compatibility.
Technical Implementation:
- Online Schema Change: Use tools like
gh-ost or pt-online-schema-change for relational databases to avoid locking. These tools create a ghost table, copy data in chunks, and sync changes via binlogs before swapping tables.
- Feature Flags: Wrap all new schema access in feature flags. The new code path should be disabled by default.
Phase 2: Dual-Write and Backfill
Update the application to write to both the old and new schemas. Simultaneously, backfill existing data to the new schema in batches.
Architecture Decision:
- Dual-Write: Implemented in the repository layer. Writes to the new schema should be best-effort or queued to avoid impacting primary latency.
- Backfill Strategy: Use a cursor-based approach with configurable batch sizes and concurrency. Implement exponential backoff on errors.
Phase 3: Dual-Read and Cutover
Switch reads to the new schema. Validate data consistency. Once confident, stop dual-writes.
Phase 4: Contract
Remove the old schema elements. This is the cleanup phase and can be done in a subsequent deployment.
Code Implementation: TypeScript Repository with Dual-Write
The following example demonstrates a repository pattern that handles dual-write logic safely, including error isolation so a failure in the new schema does not block the critical path.
import { FeatureFlagService } from './feature-flags';
import { MetricsClient } from './metrics';
import { DatabaseClient } from './db-client';
interface MigrationConfig {
batchSize: number;
maxConcurrency: number;
backfillRateLimit: number; // ms between batches
}
export class UserRepository {
private legacyDb: DatabaseClient;
private newDb: DatabaseClient;
private featureFlags: FeatureFlagService;
private metrics: MetricsClient;
private config: MigrationConfig;
constructor(
legacyDb: DatabaseClient,
newDb: DatabaseClient,
featureFlags: FeatureFlagService,
metrics: MetricsClient,
config: MigrationConfig
) {
this.legacyDb = legacyDb;
this.newDb = newDb;
this.featureFlags = featureFlags;
this.metrics = metrics;
this.config = config;
}
// Phase 2: Dual-Write Implementation
async saveUser(user: User): Promise<void> {
// 1. Write to Legacy (Critical Path)
await this.legacyDb.users.save(user);
this.metrics.increment('db.legacy.write.success');
// 2. Write to New Schema (Best Effort / Flagged)
const isNewSchemaActive = await this.featureFlags.isEnabled('user.new_schema_write');
if (isNewSchemaActive) {
// Fire-and-forget or parallel execution to minimize latency impact
this.writeToNewSchema(user).catch((err) => {
this.metrics.increment('db.new.write.error');
// Log error for alerting; do not throw to preserve critical path
console.error('New schema write failed:', err);
});
}
}
private async writeToNewSchema(user: User): Promise<void> {
const transformedUser = this.transformToNewFormat(user);
await this.newDb.users.save(transformedUser);
this.metrics.increment('db.new.write.success');
}
// Phase 2: Backfill Implementation
async runBackfill(): Promise<void> {
let lastId = 0;
let batchCount = 0;
while (true) {
// Batch query with cursor
const users = await this.legacyDb.users.findGreaterThan(lastId, this.config.batchSize);
if (users.length === 0) break;
// Parallel processing within concurrency limit
const batches = this.chunk(users, this.config.maxConcurrency);
for (const batch of batches) {
const promises = batch.map(user =>
this.newDb.users.save(this.transformToNewFormat(user))
);
await Promise.allSettled(promises);
lastId = users[users.length - 1].id;
batchCount++;
this.metrics.gauge('migration.backfill.progress', { processed: lastId, batches: batchCount });
// Rate limiting to protect DB
await this.sleep(this.config.backfillRateLimit);
}
}
}
// Phase 3: Dual-Read Implementation
async getUser(id: string): Promise<User> {
const isNewSchemaRead = await this.featureFlags.isEnabled('user.new_schema_read');
if (isNewSchemaRead) {
try {
const newUser = await this.newDb.users.findById(id);
if (newUser) return this.transformToLegacyFormat(newUser);
// Fallback to legacy if not found in new schema (handles race conditions)
} catch {
// Fall through to legacy
}
}
return this.legacyDb.users.findById(id);
}
private chunk<T>(array: T[], size: number): T[][] {
return Array.from({ length: Math.ceil(array.length / size) }, (_, i) =>
array.slice(i * size, i * size + size)
);
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
Architecture Rationale
- Error Isolation: Dual-write failures are caught and logged. A failure in the migration path never impacts the legacy path.
- Feature Flags: Every phase is controlled by flags. This allows instant rollback by toggling flags without redeploying code.
- Observability: Metrics track write success rates, backfill progress, and latency deltas. Alerting should be configured on
db.new.write.error spikes.
- Idempotency: The backfill script must be idempotent. It should handle cases where the row already exists in the new schema (e.g., via upserts or unique constraints).
Pitfall Guide
Based on production post-mortems, these are the most common failures during large-scale migrations and how to avoid them.
- Mistake: Executing
ALTER TABLE directly on a production database.
- Impact: Table locks block all writes, causing request timeouts and potential data loss if transactions are aborted.
- Best Practice: Always use online schema change tools (
gh-ost, pt-online-schema-change, or cloud-native equivalents like AWS DMS schema conversion) that create ghost tables and sync via binlogs.
2. Ignoring Foreign Key Constraints During Migration
- Mistake: Migrating a parent table without ensuring child references are updated or compatible.
- Impact: Referential integrity violations cause write failures or orphaned records.
- Best Practice: Analyze the dependency graph. Migrate child tables first if the schema change affects foreign keys, or use deferred constraints during the migration window.
3. Backfill Scripts Causing Replication Lag
- Mistake: Running backfill batches too aggressively, saturating I/O or CPU.
- Impact: Read replicas fall behind, causing stale reads for users or breaking applications that depend on read-your-writes consistency.
- Best Practice: Monitor replication lag in real-time. Implement dynamic throttling that pauses the backfill if lag exceeds a threshold (e.g., 5 seconds).
4. Hardcoding Migration Logic Without Rollback Path
- Mistake: Writing migration code that assumes the new schema is always available.
- Impact: If the migration fails or needs rollback, the application cannot function with the old schema.
- Best Practice: Always code for the "worst case." The application must work with the old schema even after the migration code is deployed. Use feature flags to gate new logic.
5. Testing Migrations on Staging with Insufficient Data
- Mistake: Validating migration scripts on staging databases that are a fraction of production size.
- Impact: Performance characteristics differ drastically. Index rebuilds that take seconds on staging may take hours on production, or cause OOM errors.
- Best Practice: Use production data dumps for migration testing, or simulate production load using tools like
pgbench or sysbench during the test migration.
6. Forgetting to Update Indexes and Constraints
- Mistake: Adding new columns but neglecting to add necessary indexes or unique constraints.
- Impact: New queries perform full table scans, causing latency spikes and increased load once the new schema is active.
- Best Practice: Include index creation in the Expand phase. Verify index usage with
EXPLAIN ANALYZE before cutover.
7. Lack of Data Consistency Validation
- Mistake: Assuming dual-write ensures data parity without verification.
- Impact: Silent data corruption where new schema has missing or incorrect data due to transformation bugs.
- Best Practice: Implement a reconciliation job that samples records from both schemas and compares fields. Run this continuously during the dual-write phase.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small Table (<10k rows) | Big Bang with Maintenance Window | Complexity of dual-write outweighs risk. Lock duration is negligible. | Low engineering cost; minimal downtime cost. |
| Large Table, Low Write Vol. | Expand/Contract with Standard DDL | Online tools add overhead. Standard DDL is safe if traffic is low. | Medium engineering cost; low downtime risk. |
| Large Table, High Write Vol. | Expand/Contract with Online Schema Tool | Locks will cause outages. Online tools handle sync via binlogs. | High engineering cost; high tooling complexity; zero downtime. |
| NoSQL to Relational | Dual-Write via CDC | Schema mismatch requires transformation. CDC captures all changes. | Very high engineering cost; requires stream processing. |
| Emergency Hotfix | Big Bang with Immediate Rollback Plan | Speed is critical. Mitigate risk with instant rollback capability. | Low engineering cost; high risk; requires rapid rollback. |
Configuration Template
Use this TypeScript configuration to standardize migration runners across your organization.
// migration.config.ts
export interface MigrationConfig {
// Feature flag keys for controlling migration phases
flags: {
writeEnabled: string;
readEnabled: string;
backfillActive: string;
};
// Backfill performance tuning
backfill: {
batchSize: number; // Rows per query
maxConcurrency: number; // Parallel write workers
rateLimitMs: number; // Delay between batches
maxReplicationLagSec: number; // Pause if lag exceeds this
retryAttempts: number;
retryBackoffBaseMs: number;
};
// Observability
metrics: {
prefix: string;
enabled: boolean;
alertThresholds: {
errorRate: number; // % of writes failing
lagThreshold: number; // seconds
};
};
// Safety
safety: {
maxRuntimeMinutes: number; // Abort if running too long
dryRunMode: boolean;
requireApproval: boolean; // Require manual token to start
};
}
export const defaultConfig: MigrationConfig = {
flags: {
writeEnabled: 'migration.user_schema_write',
readEnabled: 'migration.user_schema_read',
backfillActive: 'migration.user_backfill',
},
backfill: {
batchSize: 500,
maxConcurrency: 10,
rateLimitMs: 100,
maxReplicationLagSec: 5,
retryAttempts: 3,
retryBackoffBaseMs: 1000,
},
metrics: {
prefix: 'db.migration.user',
enabled: true,
alertThresholds: {
errorRate: 0.5,
lagThreshold: 10,
},
},
safety: {
maxRuntimeMinutes: 480,
dryRunMode: false,
requireApproval: true,
},
};
Quick Start Guide
-
Initialize Migration Runner:
Create a new migration script using the MigrationConfig template. Define the Expand, Backfill, and Contract steps.
npx codcompass-cli init-migration user-schema-v2 --config migration.config.ts
-
Deploy Expand Phase:
Run the migration runner in dryRun mode to validate SQL. Execute the Expand phase to add new columns/tables using online schema tools.
npx codcompass-cli run user-schema-v2 --phase expand --dry-run
npx codcompass-cli run user-schema-v2 --phase expand --execute
-
Enable Dual-Write:
Deploy the application update with dual-write logic. Enable the writeEnabled flag for 0% of traffic initially, then ramp up. Monitor metrics for error rates.
-
Execute Backfill:
Start the backfill process. Monitor replication lag and error rates. The runner will auto-throttle based on configuration.
npx codcompass-cli run user-schema-v2 --phase backfill --start
-
Cutover Reads:
Once backfill completes and reconciliation shows >99.99% parity, enable the readEnabled flag. Ramp read traffic gradually. After validation, disable dual-write and proceed to the Contract phase to remove legacy schema.
Database migration at scale is a discipline that demands rigorous engineering. By decoupling schema changes from deployments, enforcing backward compatibility, and utilizing automated, observable execution patterns, teams can eliminate downtime risks and maintain high availability even during significant infrastructure evolution. The investment in robust migration tooling pays immediate dividends in deployment velocity and system reliability.