ecksum validation to ensure block-level integrity.
pg_basebackup -h primary-db.internal -U backup_user -D /var/lib/pg-backup/base \
--checkpoint=fast --wal-method=stream --verify-checksums --progress
Archive the output to immutable object storage with versioning enabled. Never overwrite historical baselines.
Step 2: Continuous WAL Archiving with Retention Policy
Configure postgresql.conf to ship WAL segments to durable storage. Implement a retention window that exceeds your maximum acceptable RPO.
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://pg-wal-archive/%f --sse aws:kms'
archive_timeout = 60
Set a lifecycle policy to expire WALs only after confirming successful restore drills. Premature expiration is the leading cause of PITR failure.
Deploy a standby node in a separate availability zone. Use a lightweight orchestrator to monitor replication lag, detect primary failure, and promote safely.
import { Client } from 'pg';
import { EC2, RDS } from '@aws-sdk/client-rds';
export class PostgresFailoverOrchestrator {
private rds: RDS;
private standbyHost: string;
private primaryHost: string;
constructor(config: { standby: string; primary: string }) {
this.rds = new RDS({ region: 'us-east-1' });
this.standbyHost = config.standby;
this.primaryHost = config.primary;
}
async checkReplicationLag(): Promise<number> {
const client = new Client({ host: this.standbyHost, database: 'postgres' });
await client.connect();
const res = await client.query(
`SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag`
);
await client.end();
return Number(res.rows[0].lag) || 0;
}
async promoteStandby(): Promise<void> {
const lag = await this.checkReplicationLag();
if (lag > 30) throw new Error(`Replication lag too high: ${lag}s. Aborting promotion.`);
await this.rds.promoteReadReplica({
DBInstanceIdentifier: this.standbyHost,
});
// Update connection poolers, DNS, or service mesh routing
await this.updateRoutingTable();
}
private async updateRoutingTable(): Promise<void> {
// Implementation depends on infrastructure (Route53, HAProxy, K8s Service)
console.log('Routing updated to promoted standby');
}
}
Step 4: Recovery Validation Pipeline
Automate restore testing on isolated infrastructure. Validate data consistency, extension compatibility, and query performance before promoting to production readiness.
import { execSync } from 'child_process';
export async function validateRecovery(backupPath: string): Promise<boolean> {
const testDir = `/tmp/pg-restore-test-${Date.now()}`;
execSync(`pg_restore -d postgres ${backupPath} --single-transaction --exit-on-error`, {
cwd: testDir,
stdio: 'inherit'
});
// Run checksum validation and critical query smoke tests
const result = execSync('pg_checksums --check -D /var/lib/postgresql/data', { encoding: 'utf8' });
return result.includes('checksums verified');
}
Architecture Decisions and Rationale
- Immutable WAL storage: Prevents accidental deletion or overwriting during ransomware or operator error.
- Separate replication network: Isolates WAL traffic from application traffic, preventing backup storms from degrading production latency.
- Idempotent promotion logic: Ensures failover scripts can be retried safely without corrupting replication slots or connection pools.
- Automated validation: Recovery is only as reliable as the last successful test. Pipeline integration catches extension mismatches and schema drift before they hit production.
Pitfall Guide
-
Assuming backup size equals recovery capability
Large backups do not guarantee fast recovery. I/O bottlenecks during restore, missing WAL segments, or incompatible extensions will stall promotion regardless of storage capacity. Always measure restore time, not backup size.
-
Ignoring RPO/RTO alignment with business metrics
Engineering teams often default to technical defaults (e.g., 24-hour WAL retention) without mapping to actual data loss tolerance. A 5-minute RPO requires continuous archiving and replication; a 24-hour RPO does not. Mismatched objectives create compliance failures and false confidence.
-
Skipping cross-region latency validation
Synchronous replication across regions introduces unacceptable write latency. Asynchronous replication introduces data loss windows. Teams frequently deploy cross-region standby nodes without modeling network jitter, packet loss, or DNS TTL propagation delays. Validate failover under degraded network conditions.
-
Mixing logical and physical replication without boundaries
Logical replication supports cross-version upgrades and selective table sync but cannot recover system catalogs or extensions. Physical replication captures full cluster state but blocks version upgrades. Using both simultaneously without clear separation causes slot conflicts, lag spikes, and inconsistent promotion behavior.
-
Neglecting credential rotation during recovery
Replication slots, connection poolers, and monitoring agents often hardcode credentials. During failover, rotated secrets invalidate connections, causing promotion to hang or fail. Implement dynamic credential injection (e.g., HashiCorp Vault, AWS Secrets Manager) with short TTLs and automatic renewal.
-
Over-relying on managed service abstractions
Cloud providers handle infrastructure redundancy, but application-level recovery remains your responsibility. Managed databases do not automatically recover from logical corruption, schema drift, or extension incompatibility. Understand the underlying replication mechanics; do not treat the console toggle as a recovery strategy.
-
Failing to test DNS/TTL propagation
Promotion is instantaneous; routing updates are not. High TTL values delay client redirection, causing connection storms to the old primary. Use low TTLs (30β60 seconds) for database endpoints, implement connection retry logic with exponential backoff, and validate routing updates during drills.
Best Practice: Run recovery drills quarterly on isolated infrastructure. Measure actual RTO/RPO, document failure modes, and update runbooks. Recovery is a muscle, not a configuration.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / MVP | Daily snapshots + 24h WAL retention | Low operational overhead, acceptable data loss window for non-critical workloads | Low storage, minimal compute |
| Mid-scale SaaS | WAL archiving + PITR + async standby | Deterministic recovery, balances cost and RPO/RTO for customer-facing apps | Medium storage, moderate compute |
| Financial / Compliance | Streaming replication + auto-failover + sync WAL | Zero or near-zero data loss, strict audit trails, regulatory alignment | High storage, premium compute, network costs |
| Global Multi-Region | Multi-master active-active with conflict resolution | Eliminates single-region dependency, supports geo-distributed workloads | Very high compute, complex networking, licensing |
Configuration Template
postgresql.conf (Primary)
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
archive_mode = on
archive_command = 'aws s3 cp %p s3://pg-wal-archive/%f --sse aws:kms'
archive_timeout = 60
hot_standby = on
recovery.signal (Standby)
restore_command = 'aws s3 cp s3://pg-wal-archive/%f %p'
recovery_target_timeline = 'latest'
promote_trigger_file = '/tmp/pg_promote'
TypeScript Recovery Runner
import { execSync } from 'child_process';
import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';
export async function executePITR(targetTime: string, bucket: string) {
const s3 = new S3Client({ region: 'us-east-1' });
// 1. Restore base backup
execSync('pg_basebackup -D /var/lib/postgresql/data -Fp -Xs -P');
// 2. Configure PITR
const recoveryConf = `
restore_command = 'aws s3 cp s3://${bucket}/%f %p'
recovery_target_time = '${targetTime}'
recovery_target_action = 'promote'
`;
execSync(`echo '${recoveryConf}' > /var/lib/postgresql/data/recovery.conf`);
// 3. Start recovery
execSync('pg_ctl start -D /var/lib/postgresql/data');
// 4. Validate
const check = execSync('pg_isready -t 30', { encoding: 'utf8' });
if (!check.includes('accepting connections')) {
throw new Error('PITR validation failed');
}
}
Quick Start Guide
- Provision baseline infrastructure: Deploy primary and standby PostgreSQL instances in separate availability zones. Configure security groups to allow replication traffic on port 5432.
- Enable WAL archiving: Update
postgresql.conf with archive_mode=on and point archive_command to an S3 bucket with versioning and object lock enabled.
- Initialize standby: Run
pg_basebackup from the standby node, copy recovery.signal, and configure restore_command. Start the standby and verify replication lag stays under 5 seconds.
- Deploy failover orchestrator: Install the TypeScript failover script, configure AWS credentials, and set up a cron job or Kubernetes CronJob to monitor lag and trigger promotion if the primary becomes unreachable for >30 seconds.
- Validate: Run a controlled failover drill. Measure promotion time, verify data consistency, update DNS/TTL, and document the actual RTO/RPO against targets.