Resilience Patterns

0/2 in this phase0/45 across the roadmap

📖 Concept

In distributed systems, failures are inevitable. Network issues, server crashes, and service overloads are normal. Resilience patterns help your system survive these failures gracefully.

Key Patterns

1. Circuit Breaker

Like an electrical circuit breaker — stops calling a failing service to prevent cascading failures.

States: CLOSED (normal) → OPEN (failing, stop calling) → HALF-OPEN (test with one request)

2. Retry with Exponential Backoff

Retry failed requests with increasing delay: 1s → 2s → 4s → 8s

3. Timeout

Set maximum time to wait for a response. Without timeouts, slow services can exhaust all threads/connections.

4. Bulkhead

Isolate failures by limiting resources per component. If the search service is slow, it shouldn't consume all connections needed by the checkout service.

5. Rate Limiting

Limit the number of requests a client can make. Protects against abuse and cascading overload.

6. Fallback

When a service fails, return a degraded but acceptable response (cached data, default values, simplified functionality).

7. Health Checks

Proactively detect unhealthy services and remove them from load balancing.

Interview tip: When designing any distributed system, mention circuit breakers and timeouts. They show you understand real-world failure scenarios.

💻 Code Example

codeTap to expand ⛶
1// ============================================
2// Resilience Patterns — Implementation
3// ============================================
4
5// ---------- Circuit Breaker ----------
6class CircuitBreaker {
7 constructor(options = {}) {
8 this.failureThreshold = options.failureThreshold || 5;
9 this.resetTimeout = options.resetTimeout || 30000;
10 this.state = 'CLOSED';
11 this.failureCount = 0;
12 this.lastFailureTime = null;
13 }
14
15 async call(fn) {
16 if (this.state === 'OPEN') {
17 if (Date.now() - this.lastFailureTime > this.resetTimeout) {
18 this.state = 'HALF_OPEN';
19 console.log('Circuit HALF-OPEN: testing...');
20 } else {
21 throw new Error('Circuit OPEN: request blocked');
22 }
23 }
24
25 try {
26 const result = await fn();
27 this.onSuccess();
28 return result;
29 } catch (error) {
30 this.onFailure();
31 throw error;
32 }
33 }
34
35 onSuccess() {
36 this.failureCount = 0;
37 if (this.state === 'HALF_OPEN') {
38 this.state = 'CLOSED';
39 console.log('Circuit CLOSED: service recovered');
40 }
41 }
42
43 onFailure() {
44 this.failureCount++;
45 this.lastFailureTime = Date.now();
46 if (this.failureCount >= this.failureThreshold) {
47 this.state = 'OPEN';
48 console.log(`Circuit OPEN after \${this.failureCount} failures`);
49 }
50 }
51}
52
53// ---------- Retry with Exponential Backoff ----------
54async function retryWithBackoff(fn, maxRetries = 3, baseDelay = 1000) {
55 for (let attempt = 0; attempt <= maxRetries; attempt++) {
56 try {
57 return await fn();
58 } catch (error) {
59 if (attempt === maxRetries) throw error;
60 const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
61 console.log(`Retry \${attempt + 1}/\${maxRetries} in \${Math.round(delay)}ms`);
62 await new Promise(resolve => setTimeout(resolve, delay));
63 }
64 }
65}
66
67// ---------- Bulkhead (Thread Pool Isolation) ----------
68class Bulkhead {
69 constructor(maxConcurrent) {
70 this.maxConcurrent = maxConcurrent;
71 this.running = 0;
72 this.queue = [];
73 }
74
75 async execute(fn) {
76 if (this.running >= this.maxConcurrent) {
77 return new Promise((resolve, reject) => {
78 this.queue.push({ fn, resolve, reject });
79 });
80 }
81 return this.runTask(fn);
82 }
83
84 async runTask(fn) {
85 this.running++;
86 try {
87 return await fn();
88 } finally {
89 this.running--;
90 if (this.queue.length > 0) {
91 const next = this.queue.shift();
92 this.runTask(next.fn).then(next.resolve).catch(next.reject);
93 }
94 }
95 }
96}
97
98// ---------- Resilient Service Client ----------
99class ResilientClient {
100 constructor(serviceName) {
101 this.serviceName = serviceName;
102 this.circuitBreaker = new CircuitBreaker({ failureThreshold: 3, resetTimeout: 10000 });
103 this.bulkhead = new Bulkhead(10);
104 this.timeout = 5000;
105 }
106
107 async call(url, options) {
108 return this.bulkhead.execute(() =>
109 this.circuitBreaker.call(() =>
110 this.fetchWithTimeout(url, options)
111 )
112 );
113 }
114
115 async fetchWithTimeout(url, options) {
116 const controller = new AbortController();
117 const timer = setTimeout(() => controller.abort(), this.timeout);
118 try {
119 const response = await fetch(url, { ...options, signal: controller.signal });
120 return response.json();
121 } finally {
122 clearTimeout(timer);
123 }
124 }
125}
126
127// Demo
128const cb = new CircuitBreaker({ failureThreshold: 3 });
129async function demo() {
130 for (let i = 0; i < 5; i++) {
131 try {
132 await cb.call(async () => { throw new Error('Service down'); });
133 } catch (e) {
134 console.log(`Request \${i + 1}: \${e.message}`);
135 }
136 }
137}
138demo();

🏋️ Practice Exercise

  1. Circuit Breaker Design: Implement a circuit breaker for a payment gateway that: opens after 3 failures in 60 seconds, checks health every 30 seconds in half-open state, and logs all state transitions.

  2. Retry Strategy: Design retry policies for: (a) Payment processing (idempotent), (b) Email sending (non-critical), (c) Database writes (transient failures), (d) Third-party API calls (rate limited).

  3. Bulkhead Architecture: Design bulkhead isolation for a service that calls 4 downstream services. How many connections per bulkhead? What happens when one bulkhead is full?

  4. Graceful Degradation: Design fallback strategies for when the recommendation service is down: What does the user see? How does the system behave? When does it recover?

⚠️ Common Mistakes

  • Not setting timeouts — without timeouts, a hanging downstream service consumes threads/connections indefinitely, eventually crashing the caller. Always set explicit timeouts.

  • Retrying non-idempotent operations — retrying a payment charge without idempotency keys could charge the customer twice. Ensure operations are idempotent before adding retries.

  • Circuit breaker threshold too sensitive — if the breaker opens after 1 failure, a single network blip causes minutes of disruption. Set thresholds based on error rate, not absolute count.

  • No fallback when circuit is open — returning a 503 error is better than cascading failure, but returning cached/default data is better than an error when possible.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Resilience Patterns