achDB, and MongoDB.
Step 1: Architecture Design
Use Patroni for leader election and cluster management, backed by etcd or Consul for distributed consensus. Patroni monitors PostgreSQL health, manages streaming replication, and handles automatic failover without DNS dependency. etcd provides linearizable reads and Raft-based consensus, ensuring a single authoritative leader during network partitions.
Architecture decision rationale:
- Patroni over DNS-based failover: DNS TTL propagation introduces 30β60s delays and cannot detect split-brain. Patroni queries etcd directly, achieving sub-10s failover.
- etcd over Consul: etcd's Raft implementation guarantees linearizable reads, critical for leader election consistency. Consul's eventual consistency model can cause temporary dual-primary states under partition.
- Streaming replication over logical: Streaming operates at the WAL level, providing lower latency and stronger consistency. Logical replication introduces schema dependency and conflict resolution overhead.
Step 2: Synchronous vs Asynchronous Replication
Synchronous replication guarantees RPO=0 but introduces write latency proportional to network round-trip time to the synchronous standby. Asynchronous replication minimizes latency but risks data loss during failover.
Production recommendation: Use synchronous_standby_names with quorum mode for 3-node clusters. This allows one standby to lag without blocking writes, while maintaining zero data loss if the primary fails.
-- postgresql.conf
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (pg_standby_1, pg_standby_2)'
Step 3: Application Resilience Layer (TypeScript)
Database failovers trigger connection resets. Applications must handle ECONNRESET, ETIMEDOUT, and CONNECTION_CLOSED gracefully. Implement exponential backoff with jitter, circuit breaking, and connection draining.
import { Pool, PoolClient } from 'pg';
class DatabaseResilienceLayer {
private pool: Pool;
private circuitBreaker: { failures: number; threshold: number; resetTime: number; lastFailure: number };
private readonly MAX_RETRIES = 3;
private readonly BASE_DELAY = 500;
constructor(config: any) {
this.pool = new Pool(config);
this.circuitBreaker = { failures: 0, threshold: 5, resetTime: 30000, lastFailure: 0 };
}
private async executeWithRetry<T>(query: () => Promise<T>, attempt = 0): Promise<T> {
if (this.isCircuitOpen()) throw new Error('Circuit breaker open: database unreachable');
try {
return await query();
} catch (err: any) {
const isRetryable = err.code === 'ECONNRESET' || err.code === 'ETIMEDOUT' || err.message.includes('connection closed');
if (isRetryable && attempt < this.MAX_RETRIES) {
this.circuitBreaker.failures++;
this.circuitBreaker.lastFailure = Date.now();
const delay = this.BASE_DELAY * Math.pow(2, attempt) + Math.random() * 500;
await new Promise(res => setTimeout(res, delay));
return this.executeWithRetry(query, attempt + 1);
}
throw err;
}
}
private isCircuitOpen(): boolean {
const now = Date.now();
if (this.circuitBreaker.failures >= this.circuitBreaker.threshold) {
if (now - this.circuitBreaker.lastFailure > this.circuitBreaker.resetTime) {
this.circuitBreaker.failures = 0; // Half-open state
return false;
}
return true;
}
return false;
}
async query(text: string, params?: any[]) {
return this.executeWithRetry(() => this.pool.query(text, params));
}
async acquire(): Promise<PoolClient> {
return this.executeWithRetry(() => this.pool.connect());
}
}
Application-level resilience prevents connection storms during failover. The circuit breaker stops requests when the database is unreachable, allowing the cluster to stabilize. Exponential backoff with jitter distributes retry pressure across the cluster.
Step 4: Connection Pooling & Read/Write Splitting
Deploy PgBouncer or Odyssey in front of the database cluster. Configure server_reset_query and server_check_query to detect failed connections before they reach the application. Route reads to replicas, writes to the Patroni-managed leader endpoint.
# pgbouncer.ini
[databases]
* = host=leader-endpoint port=5432
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
server_reset_query = DEALLOCATE ALL
server_check_query = SELECT 1
During failover, PgBouncer detects leader changes via health checks, closes stale connections, and establishes new ones to the promoted standby. This eliminates application-side connection timeouts.
Pitfall Guide
-
Assuming Multi-AZ Equals Zero Downtime
Multi-AZ deployments distribute nodes across failure domains but do not eliminate failover latency. DNS propagation, connection draining, and leader election timeouts still occur. Mitigation: Implement health-check-based routing (e.g., HAProxy, Envoy) instead of DNS TTL-dependent failover.
-
Misconfiguring Synchronous Replication Under Network Jitter
Forcing synchronous replication on unstable networks causes write stalls. The primary blocks until the standby acknowledges WAL flush. Mitigation: Use quorum sync mode, monitor pg_stat_replication.sync_state, and set explicit synchronous_commit thresholds based on network SLAs.
-
Ignoring Connection Pool Exhaustion During Failover
When the leader steps down, all active connections drop. If the application retries instantly without backoff, the new leader receives a connection storm, triggering OOM or max_connections limits. Mitigation: Enforce exponential backoff, limit retry concurrency, and configure PgBouncer max_db_connections with headroom.
-
Skipping Controlled Failover Drills
Theoretical HA configurations rarely match production behavior. Unplanned failovers expose configuration drift, missing permissions, and monitoring gaps. Mitigation: Schedule monthly chaos tests using patronictl switchover. Verify RTO/RPO metrics post-failover.
-
Overloading Read Replicas for Analytical Queries
Heavy analytical workloads on replicas increase replication lag. If lag exceeds the application's tolerance, read-your-writes consistency breaks, and failover promotion may restore stale data. Mitigation: Isolate analytical workloads to dedicated read nodes or materialized views. Monitor pg_replication_lag and alert at >5s.
-
Treating Backups as High Availability
Backups enable disaster recovery, not high availability. Restore operations take minutes to hours and cannot satisfy RTO < 30s. Mitigation: Maintain separate DR pipelines with cross-region replication. HA handles node failures; DR handles region/catastrophic failures.
-
Application-Level Silent Failures
Catching database errors and returning empty responses or cached data masks availability degradation. Users experience stale data without alerts. Mitigation: Fail fast. Return explicit 503 Service Unavailable when the circuit breaker trips. Route to fallback endpoints only after explicit SLA violation.
Best practice from production: Implement a dedicated HA health endpoint that reports replication lag, leader status, and connection pool saturation. Integrate this into your alerting pipeline. HA is observable or it is broken.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup MVP / Low Traffic | Managed Cloud HA (RDS/Cloud SQL) | Abstracts replication, reduces ops overhead, sufficient for <1k RPS | Low infrastructure cost, higher vendor lock-in |
| Financial Transactions / Strict RPO=0 | Active-Passive Streaming + Synchronous Quorum | Guarantees zero data loss, predictable failover, explicit consistency | Medium compute cost, requires 3+ nodes |
| Global Read-Heavy / Low Write Latency | Active-Active Logical + Conflict Resolution | Enables multi-region writes, reduces read latency | High operational complexity, requires conflict handling |
| Regulatory Compliance / Audit Requirements | Active-Passive + Cross-Region DR Pipeline | Separates HA from DR, maintains immutable audit trails | High storage/network cost, justified by compliance |
Configuration Template
# docker-compose.yml for Patroni + etcd + PostgreSQL HA
version: '3.8'
services:
etcd1:
image: bitnami/etcd:3.5
environment:
- ETCD_NAME=etcd1
- ETCD_INITIAL_CLUSTER=etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
- ETCD_INITIAL_CLUSTER_STATE=new
ports: ["2379:2379", "2380:2380"]
etcd2:
image: bitnami/etcd:3.5
environment:
- ETCD_NAME=etcd2
- ETCD_INITIAL_CLUSTER=etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
- ETCD_INITIAL_CLUSTER_STATE=new
ports: ["2381:2379", "2382:2380"]
etcd3:
image: bitnami/etcd:3.5
environment:
- ETCD_NAME=etcd3
- ETCD_INITIAL_CLUSTER=etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380
- ETCD_INITIAL_CLUSTER_STATE=new
ports: ["2383:2379", "2384:2380"]
patroni1:
image: zalando/patroni:latest
environment:
- PATRONI_NAME=pg1
- PATRONI_POSTGRESQL_DATA_DIR=/home/postgres/pgdata
- PATRONI_POSTGRESQL_LISTEN=0.0.0.0:5432
- PATRONI_ETCD3_HOSTS=etcd1:2379,etcd2:2379,etcd3:2379
- PATRONI_SCOPE=ha_cluster
- PATRONI_RESTAPI_CONNECT_ADDRESS=patroni1:8008
ports: ["5432:5432", "8008:8008"]
depends_on: [etcd1, etcd2, etcd3]
volumes:
- pg1_data:/home/postgres/pgdata
patroni2:
image: zalando/patroni:latest
environment:
- PATRONI_NAME=pg2
- PATRONI_POSTGRESQL_DATA_DIR=/home/postgres/pgdata
- PATRONI_POSTGRESQL_LISTEN=0.0.0.0:5432
- PATRONI_ETCD3_HOSTS=etcd1:2379,etcd2:2379,etcd3:2379
- PATRONI_SCOPE=ha_cluster
- PATRONI_RESTAPI_CONNECT_ADDRESS=patroni2:8008
ports: ["5433:5432", "8009:8008"]
depends_on: [etcd1, etcd2, etcd3]
volumes:
- pg2_data:/home/postgres/pgdata
volumes:
pg1_data:
pg2_data:
-- pg_hba.conf snippet for replication
host replication replicator 10.0.0.0/8 scram-sha-256
host all all 10.0.0.0/8 scram-sha-256
Quick Start Guide
- Deploy etcd cluster: Run the etcd services from the template. Verify quorum with
etcdctl endpoint health --cluster.
- Initialize Patroni nodes: Start
patroni1 and patroni2. Patroni will automatically initialize PostgreSQL, configure streaming replication, and register with etcd.
- Validate leader election: Run
patronictl -c patroni.yml list ha_cluster. Confirm one node shows Leader and the other Replica.
- Test failover: Execute
patronictl switchover ha_cluster --candidate pg2. Verify leadership transfers within 5β10 seconds and replication resumes.
- Connect application: Point your TypeScript resilience layer to the Patroni REST API endpoint or load balancer. Monitor
pg_stat_replication and application retry metrics.