Reliability Engineering & Disaster Recovery

0/2 in this phase0/45 across the roadmap

📖 Concept

Site Reliability Engineering (SRE) focuses on building and maintaining reliable systems at scale. Key concepts include redundancy, disaster recovery, and chaos engineering.

Redundancy Levels

Level	Description	Protects Against
Server	Multiple app servers behind LB	Single server failure
Zone	Replicate across availability zones	Datacenter failure
Region	Active in multiple geographic regions	Regional disaster

Disaster Recovery Strategies

Strategy	RPO	RTO	Cost
Backup & Restore	Hours	Hours	$
Pilot Light	Minutes	Minutes	$$
Warm Standby	Seconds	Seconds	$$$
Active-Active	Zero	Zero	$$$$

RPO (Recovery Point Objective): How much data can you afford to lose? RTO (Recovery Time Objective): How long can you be down?

Chaos Engineering

Intentionally inject failures to test system resilience:

Kill random servers (Netflix Chaos Monkey)
Simulate network failures between services
Inject latency into critical paths
Fill disk space, exhaust memory

Philosophy: If you're going to have failures in production (and you will), it's better to fail on your terms during business hours than at 3 AM.

Deployment Strategies

Strategy	Description	Risk
Rolling	Replace instances gradually	Moderate (rollback possible)
Blue-Green	Switch traffic from old to new instantly	Low (instant rollback)
Canary	Route small % of traffic to new version	Very low (test with real traffic)
Feature Flags	Enable features per user/group	Lowest (toggle instantly)

Interview tip: Mentioning chaos engineering and deployment strategies shows you think about operational reliability, not just building features.

💻 Code Example

codeTap to expand ⛶

1// ============================================
2// Reliability Engineering — Patterns
3// ============================================
4
5// ---------- Canary Deployment ----------
6class CanaryDeployment {
7  constructor(loadBalancer) {
8    this.lb = loadBalancer;
9    this.canaryPercentage = 5; // Start with 5%
10    this.metrics = { canary: { errors: 0, requests: 0 }, stable: { errors: 0, requests: 0 } };
11  }
12
13  routeRequest(req) {
14    const isCanary = Math.random() * 100 < this.canaryPercentage;
15    return isCanary ? 'canary' : 'stable';
16  }
17
18  recordResult(version, success) {
19    this.metrics[version].requests++;
20    if (!success) this.metrics[version].errors++;
21
22    // Auto-promote or rollback
23    if (this.metrics.canary.requests >= 1000) {
24      const canaryErrorRate = this.metrics.canary.errors / this.metrics.canary.requests;
25      const stableErrorRate = this.metrics.stable.errors / this.metrics.stable.requests;
26
27      if (canaryErrorRate > stableErrorRate * 2) {
28        console.log('🚨 Canary error rate too high — ROLLING BACK');
29        this.canaryPercentage = 0;
30      } else {
31        this.canaryPercentage = Math.min(100, this.canaryPercentage * 2);
32        console.log(`✅ Canary healthy — increasing to \${this.canaryPercentage}%`);
33      }
34    }
35  }
36}
37
38// ---------- Feature Flags ----------
39class FeatureFlags {
40  constructor() {
41    this.flags = new Map();
42  }
43
44  setFlag(name, config) {
45    this.flags.set(name, {
46      enabled: config.enabled || false,
47      percentage: config.percentage || 0,
48      allowlist: config.allowlist || [],
49      rules: config.rules || [],
50    });
51  }
52
53  isEnabled(flagName, userId = null, context = {}) {
54    const flag = this.flags.get(flagName);
55    if (!flag) return false;
56    if (!flag.enabled) return false;
57
58    // Check allowlist (specific users)
59    if (userId && flag.allowlist.includes(userId)) return true;
60
61    // Check percentage rollout
62    if (flag.percentage > 0 && userId) {
63      const hash = this.stableHash(userId + flagName);
64      return (hash % 100) < flag.percentage;
65    }
66
67    return flag.enabled;
68  }
69
70  stableHash(str) {
71    let hash = 0;
72    for (let i = 0; i < str.length; i++) {
73      hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;
74    }
75    return Math.abs(hash);
76  }
77}
78
79// Demo — Feature flags
80const ff = new FeatureFlags();
81ff.setFlag('new_checkout', {
82  enabled: true,
83  percentage: 10, // 10% of users
84  allowlist: ['user_internal1', 'user_beta'],
85});
86
87console.log('Internal user:', ff.isEnabled('new_checkout', 'user_internal1')); // true (allowlist)
88console.log('Random user:', ff.isEnabled('new_checkout', 'user_random'));      // ~10% chance
89
90// Canary demo
91const canary = new CanaryDeployment(null);
92for (let i = 0; i < 100; i++) {
93  const version = canary.routeRequest({});
94  canary.recordResult(version, Math.random() > 0.02);
95}
96console.log('Canary metrics:', canary.metrics);

🏋️ Practice Exercise

DR Plan: Design a disaster recovery plan for an e-commerce platform. Define RPO and RTO for each component: user data, orders, product catalog, search index, payment records.
Chaos Engineering: Design 5 chaos experiments for a microservices application. For each: what you inject, what you expect to happen, and how the system should recover.
Blue-Green Deployment: Design a blue-green deployment pipeline for a web application with a database migration. How do you handle backward-incompatible schema changes?
Post-Mortem: Write a post-mortem template for an outage. Include: timeline, root cause, impact, detection, resolution, and action items.

⚠️ Common Mistakes

No disaster recovery plan — hoping failures won't happen is not a strategy. Define RPO/RTO, practice failovers, and test backups regularly.
Deploying everything at once — big-bang deployments are the highest risk. Use canary or rolling deployments to catch issues before they affect all users.
Not testing backups — a backup that can't be restored is worthless. Regularly test restore procedures and measure actual RTO.
Feature flags without cleanup — accumulated stale feature flags add complexity and bugs. Remove flags after rollout is complete.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Reliability Engineering & Disaster Recovery

Was this topic helpful?

← PreviousThree Pillars of Observability Next →System Design Interview Framework