Database Replication

0/6 in this phase0/45 across the roadmap

📖 Concept

Database replication is the practice of maintaining copies of a database on multiple servers. It serves two purposes: high availability (if one server dies, others take over) and read scalability (distribute reads across multiple servers).

Replication Topologies

1. Single Leader (Primary-Replica)

One server (primary) handles all writes. Changes are replicated to one or more replicas (read-only).

Client Writes → [Primary] → replicates → [Replica 1] ← Client Reads
                                        → [Replica 2] ← Client Reads
                                        → [Replica 3] ← Client Reads

Pros: Simple, consistent writes, easy to understand Cons: Write bottleneck (single primary), failover complexity

2. Multi-Leader

Multiple servers can accept writes. Changes are replicated between all leaders.

Pros: Write scalability, multi-region support Cons: Conflict resolution is HARD (two leaders update the same row differently)

3. Leaderless (Dynamo-style)

No leader — any node can accept reads and writes. Uses quorum for consistency.

Pros: High availability, no single point of failure Cons: Complex conflict resolution, eventual consistency Used by: Cassandra, DynamoDB, Riak

Replication Methods

Method	How It Works	Lag	Consistency
Synchronous	Primary waits for ALL replicas to confirm	None	Strong
Asynchronous	Primary doesn't wait for replicas	Milliseconds to seconds	Eventual
Semi-synchronous	Primary waits for ONE replica, rest are async	Minimal	Near-strong

Replication Lag Problem

With asynchronous replication, there's a delay between a write on the primary and the read seeing it on the replica. This causes:

Read-your-own-write inconsistency: User updates profile → reads from replica → sees old data
Causal inconsistency: User B sees User A's comment but not the post it refers to

Solutions:

Read-after-write consistency: Route reads to primary for recently-written data
Monotonic reads: Route a user to the same replica consistently (sticky routing)
Causal consistency: Track dependencies and ensure replicas apply changes in order

Interview tip: When you mention read replicas, always acknowledge replication lag and how you'd handle it for your specific use case.

💻 Code Example

codeTap to expand ⛶

1// ============================================
2// Database Replication — Patterns & Pitfalls
3// ============================================
4
5// ---------- Read-Write Splitting ----------
6
7class ReplicatedDatabase {
8  constructor(primary, replicas) {
9    this.primary = primary;     // Write + strong-consistency reads
10    this.replicas = replicas;   // Read-only, eventually consistent
11    this.currentReplicaIndex = 0;
12  }
13
14  // ✅ Route writes to primary
15  async write(query, params) {
16    console.log(`[WRITE] → Primary: \${this.primary.host}`);
17    return await this.primary.pool.query(query, params);
18  }
19
20  // ✅ Route reads to replicas (round-robin)
21  async read(query, params) {
22    const replica = this.getNextReplica();
23    console.log(`[READ] → Replica: \${replica.host}`);
24    return await replica.pool.query(query, params);
25  }
26
27  // ✅ Strong read — when you MUST read your own write
28  async readStrong(query, params) {
29    console.log(`[STRONG READ] → Primary: \${this.primary.host}`);
30    return await this.primary.pool.query(query, params);
31  }
32
33  getNextReplica() {
34    const replica = this.replicas[this.currentReplicaIndex];
35    this.currentReplicaIndex = (this.currentReplicaIndex + 1) % this.replicas.length;
36    return replica;
37  }
38}
39
40// ---------- Read-After-Write Consistency ----------
41
42class ConsistentReadRouter {
43  constructor(db) {
44    this.db = db;
45    // Track recent writes per user (in Redis in production)
46    this.recentWrites = new Map(); // userId → timestamp
47  }
48
49  async handleWrite(userId, query, params) {
50    const result = await this.db.write(query, params);
51    // Record that this user just wrote data
52    this.recentWrites.set(userId, Date.now());
53    return result;
54  }
55
56  async handleRead(userId, query, params) {
57    const lastWrite = this.recentWrites.get(userId);
58    const replicationLagMs = 2000; // Assume 2 second max replication lag
59
60    if (lastWrite && Date.now() - lastWrite < replicationLagMs) {
61      // User wrote recently — read from primary to see their own write
62      console.log(`[READ] User \${userId} wrote recently → routing to PRIMARY`);
63      return await this.db.readStrong(query, params);
64    }
65
66    // No recent write — safe to read from replica
67    console.log(`[READ] User \${userId} no recent writes → routing to REPLICA`);
68    return await this.db.read(query, params);
69  }
70}
71
72// ---------- Failover Manager ----------
73
74class FailoverManager {
75  constructor(primary, replicas) {
76    this.primary = primary;
77    this.replicas = replicas;
78    this.healthCheckInterval = 5000; // Check every 5 seconds
79    this.failoverThreshold = 3;     // 3 missed heartbeats → failover
80    this.missedHeartbeats = 0;
81  }
82
83  startHealthChecks() {
84    setInterval(async () => {
85      try {
86        await this.primary.pool.query('SELECT 1'); // Heartbeat
87        this.missedHeartbeats = 0;
88      } catch (error) {
89        this.missedHeartbeats++;
90        console.warn(
91          `⚠️ Primary heartbeat missed (\${this.missedHeartbeats}/\${this.failoverThreshold})`
92        );
93
94        if (this.missedHeartbeats >= this.failoverThreshold) {
95          await this.performFailover();
96        }
97      }
98    }, this.healthCheckInterval);
99  }
100
101  async performFailover() {
102    console.log('🚨 PRIMARY DOWN — Initiating failover...');
103
104    // 1. Select the most up-to-date replica
105    let bestReplica = null;
106    let minLag = Infinity;
107
108    for (const replica of this.replicas) {
109      try {
110        const lag = await this.getReplicationLag(replica);
111        if (lag < minLag) {
112          minLag = lag;
113          bestReplica = replica;
114        }
115      } catch (e) {
116        console.log(`Replica \${replica.host} is also down`);
117      }
118    }
119
120    if (!bestReplica) {
121      console.error('🚨 ALL REPLICAS DOWN — FULL OUTAGE');
122      return;
123    }
124
125    // 2. Promote the best replica to primary
126    console.log(`✅ Promoting \${bestReplica.host} to PRIMARY (lag: \${minLag}ms)`);
127    await bestReplica.pool.query('SELECT pg_promote()');
128
129    // 3. Update routing
130    const oldPrimary = this.primary;
131    this.primary = bestReplica;
132    this.replicas = this.replicas.filter(r => r !== bestReplica);
133
134    // 4. Notify other replicas to follow new primary
135    for (const replica of this.replicas) {
136      console.log(`Reconfiguring \${replica.host} to follow new primary`);
137      // In production: update replication settings
138    }
139
140    console.log('✅ Failover complete');
141  }
142
143  async getReplicationLag(replica) {
144    // PostgreSQL: Check replication lag in bytes
145    const result = await replica.pool.query(
146      "SELECT EXTRACT(EPOCH FROM replay_lag) * 1000 as lag_ms FROM pg_stat_replication"
147    );
148    return result.rows[0]?.lag_ms || 0;
149  }
150}
151
152// ---------- Quorum Reads/Writes (Leaderless) ----------
153
154class QuorumDatabase {
155  constructor(nodes) {
156    this.nodes = nodes;              // All database nodes
157    this.n = nodes.length;           // Total nodes
158    this.writeQuorum = Math.floor(this.n / 2) + 1; // Write majority
159    this.readQuorum = Math.floor(this.n / 2) + 1;  // Read majority
160    // w + r > n ensures overlap → at least one node has latest data
161  }
162
163  async quorumWrite(key, value) {
164    const writePromises = this.nodes.map(node =>
165      node.write(key, value).catch(err => ({ error: err }))
166    );
167
168    const results = await Promise.all(writePromises);
169    const successes = results.filter(r => !r.error);
170
171    if (successes.length >= this.writeQuorum) {
172      console.log(`✅ Write succeeded (\${successes.length}/\${this.n} nodes)`);
173      return true;
174    } else {
175      console.log(`❌ Write failed (only \${successes.length}/\${this.writeQuorum} required)`);
176      return false;
177    }
178  }
179
180  async quorumRead(key) {
181    const readPromises = this.nodes.map(node =>
182      node.read(key).catch(err => ({ error: err }))
183    );
184
185    const results = await Promise.all(readPromises);
186    const successes = results.filter(r => !r.error);
187
188    if (successes.length >= this.readQuorum) {
189      // Return the value with the highest version/timestamp
190      const latest = successes.reduce((a, b) =>
191        a.version > b.version ? a : b
192      );
193      return latest.value;
194    }
195    throw new Error('Read quorum not met');
196  }
197}
198
199console.log("Replication patterns demonstrated.");

🏋️ Practice Exercise

Read Replica Design: Your PostgreSQL primary handles 5K writes/sec and 50K reads/sec. Design a read replica architecture that distributes the read load. How many replicas do you need? How do you handle replication lag?
Failover Scenario: Your primary database goes down. Walk through the exact steps of a manual failover, then design an automatic failover system. What data could be lost during failover?
Multi-Region Replication: Design a replication strategy for a global application with users in US, EU, and Asia. How do you handle: (a) read latency, (b) write conflicts, (c) data residency (GDPR)?
Quorum Calculator: For a 5-node leaderless system, calculate the minimum W (write) and R (read) quorum sizes for: (a) strong consistency, (b) eventual consistency, (c) if one node is permanently down.
Replication Lag Measurement: Design a monitoring system that detects when replication lag exceeds 5 seconds and automatically routes reads to the primary until lag recovers.

⚠️ Common Mistakes

Assuming replicas are always consistent with the primary — asynchronous replication has lag. Reading from a replica immediately after writing to the primary may return stale data.
Not planning for failover — when the primary dies, promoting a replica isn't automatic. You need automated failover with proper health checks, or you risk extended downtime.
Using replicas without monitoring replication lag — if lag grows to minutes or hours (due to long-running queries or heavy writes), replicas return severely outdated data. Always monitor and alert on lag.
Forgetting that writes during failover may be lost — asynchronous replicas may not have received the last few writes before the primary crashed. This data is gone. For zero data loss, use synchronous replication (at the cost of write latency).

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Database Replication

Was this topic helpful?

← PreviousDatabase Sharding & Partitioning Next →Data Modeling Patterns