Resilience Patterns
📖 Concept
In distributed systems, failures are inevitable. Network issues, server crashes, and service overloads are normal. Resilience patterns help your system survive these failures gracefully.
Key Patterns
1. Circuit Breaker
Like an electrical circuit breaker — stops calling a failing service to prevent cascading failures.
States: CLOSED (normal) → OPEN (failing, stop calling) → HALF-OPEN (test with one request)
2. Retry with Exponential Backoff
Retry failed requests with increasing delay: 1s → 2s → 4s → 8s
3. Timeout
Set maximum time to wait for a response. Without timeouts, slow services can exhaust all threads/connections.
4. Bulkhead
Isolate failures by limiting resources per component. If the search service is slow, it shouldn't consume all connections needed by the checkout service.
5. Rate Limiting
Limit the number of requests a client can make. Protects against abuse and cascading overload.
6. Fallback
When a service fails, return a degraded but acceptable response (cached data, default values, simplified functionality).
7. Health Checks
Proactively detect unhealthy services and remove them from load balancing.
Interview tip: When designing any distributed system, mention circuit breakers and timeouts. They show you understand real-world failure scenarios.
💻 Code Example
1// ============================================2// Resilience Patterns — Implementation3// ============================================45// ---------- Circuit Breaker ----------6class CircuitBreaker {7 constructor(options = {}) {8 this.failureThreshold = options.failureThreshold || 5;9 this.resetTimeout = options.resetTimeout || 30000;10 this.state = 'CLOSED';11 this.failureCount = 0;12 this.lastFailureTime = null;13 }1415 async call(fn) {16 if (this.state === 'OPEN') {17 if (Date.now() - this.lastFailureTime > this.resetTimeout) {18 this.state = 'HALF_OPEN';19 console.log('Circuit HALF-OPEN: testing...');20 } else {21 throw new Error('Circuit OPEN: request blocked');22 }23 }2425 try {26 const result = await fn();27 this.onSuccess();28 return result;29 } catch (error) {30 this.onFailure();31 throw error;32 }33 }3435 onSuccess() {36 this.failureCount = 0;37 if (this.state === 'HALF_OPEN') {38 this.state = 'CLOSED';39 console.log('Circuit CLOSED: service recovered');40 }41 }4243 onFailure() {44 this.failureCount++;45 this.lastFailureTime = Date.now();46 if (this.failureCount >= this.failureThreshold) {47 this.state = 'OPEN';48 console.log(`Circuit OPEN after \${this.failureCount} failures`);49 }50 }51}5253// ---------- Retry with Exponential Backoff ----------54async function retryWithBackoff(fn, maxRetries = 3, baseDelay = 1000) {55 for (let attempt = 0; attempt <= maxRetries; attempt++) {56 try {57 return await fn();58 } catch (error) {59 if (attempt === maxRetries) throw error;60 const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;61 console.log(`Retry \${attempt + 1}/\${maxRetries} in \${Math.round(delay)}ms`);62 await new Promise(resolve => setTimeout(resolve, delay));63 }64 }65}6667// ---------- Bulkhead (Thread Pool Isolation) ----------68class Bulkhead {69 constructor(maxConcurrent) {70 this.maxConcurrent = maxConcurrent;71 this.running = 0;72 this.queue = [];73 }7475 async execute(fn) {76 if (this.running >= this.maxConcurrent) {77 return new Promise((resolve, reject) => {78 this.queue.push({ fn, resolve, reject });79 });80 }81 return this.runTask(fn);82 }8384 async runTask(fn) {85 this.running++;86 try {87 return await fn();88 } finally {89 this.running--;90 if (this.queue.length > 0) {91 const next = this.queue.shift();92 this.runTask(next.fn).then(next.resolve).catch(next.reject);93 }94 }95 }96}9798// ---------- Resilient Service Client ----------99class ResilientClient {100 constructor(serviceName) {101 this.serviceName = serviceName;102 this.circuitBreaker = new CircuitBreaker({ failureThreshold: 3, resetTimeout: 10000 });103 this.bulkhead = new Bulkhead(10);104 this.timeout = 5000;105 }106107 async call(url, options) {108 return this.bulkhead.execute(() =>109 this.circuitBreaker.call(() =>110 this.fetchWithTimeout(url, options)111 )112 );113 }114115 async fetchWithTimeout(url, options) {116 const controller = new AbortController();117 const timer = setTimeout(() => controller.abort(), this.timeout);118 try {119 const response = await fetch(url, { ...options, signal: controller.signal });120 return response.json();121 } finally {122 clearTimeout(timer);123 }124 }125}126127// Demo128const cb = new CircuitBreaker({ failureThreshold: 3 });129async function demo() {130 for (let i = 0; i < 5; i++) {131 try {132 await cb.call(async () => { throw new Error('Service down'); });133 } catch (e) {134 console.log(`Request \${i + 1}: \${e.message}`);135 }136 }137}138demo();
🏋️ Practice Exercise
Circuit Breaker Design: Implement a circuit breaker for a payment gateway that: opens after 3 failures in 60 seconds, checks health every 30 seconds in half-open state, and logs all state transitions.
Retry Strategy: Design retry policies for: (a) Payment processing (idempotent), (b) Email sending (non-critical), (c) Database writes (transient failures), (d) Third-party API calls (rate limited).
Bulkhead Architecture: Design bulkhead isolation for a service that calls 4 downstream services. How many connections per bulkhead? What happens when one bulkhead is full?
Graceful Degradation: Design fallback strategies for when the recommendation service is down: What does the user see? How does the system behave? When does it recover?
⚠️ Common Mistakes
Not setting timeouts — without timeouts, a hanging downstream service consumes threads/connections indefinitely, eventually crashing the caller. Always set explicit timeouts.
Retrying non-idempotent operations — retrying a payment charge without idempotency keys could charge the customer twice. Ensure operations are idempotent before adding retries.
Circuit breaker threshold too sensitive — if the breaker opens after 1 failure, a single network blip causes minutes of disruption. Set thresholds based on error rate, not absolute count.
No fallback when circuit is open — returning a 503 error is better than cascading failure, but returning cached/default data is better than an error when possible.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Resilience Patterns