Reliability Engineering & Disaster Recovery

0/2 in this phase0/45 across the roadmap

📖 Concept

Site Reliability Engineering (SRE) focuses on building and maintaining reliable systems at scale. Key concepts include redundancy, disaster recovery, and chaos engineering.

Redundancy Levels

Level Description Protects Against
Server Multiple app servers behind LB Single server failure
Zone Replicate across availability zones Datacenter failure
Region Active in multiple geographic regions Regional disaster

Disaster Recovery Strategies

Strategy RPO RTO Cost
Backup & Restore Hours Hours $
Pilot Light Minutes Minutes $$
Warm Standby Seconds Seconds $$$
Active-Active Zero Zero $$$$

RPO (Recovery Point Objective): How much data can you afford to lose? RTO (Recovery Time Objective): How long can you be down?

Chaos Engineering

Intentionally inject failures to test system resilience:

  • Kill random servers (Netflix Chaos Monkey)
  • Simulate network failures between services
  • Inject latency into critical paths
  • Fill disk space, exhaust memory

Philosophy: If you're going to have failures in production (and you will), it's better to fail on your terms during business hours than at 3 AM.

Deployment Strategies

Strategy Description Risk
Rolling Replace instances gradually Moderate (rollback possible)
Blue-Green Switch traffic from old to new instantly Low (instant rollback)
Canary Route small % of traffic to new version Very low (test with real traffic)
Feature Flags Enable features per user/group Lowest (toggle instantly)

Interview tip: Mentioning chaos engineering and deployment strategies shows you think about operational reliability, not just building features.

💻 Code Example

codeTap to expand ⛶
1// ============================================
2// Reliability Engineering — Patterns
3// ============================================
4
5// ---------- Canary Deployment ----------
6class CanaryDeployment {
7 constructor(loadBalancer) {
8 this.lb = loadBalancer;
9 this.canaryPercentage = 5; // Start with 5%
10 this.metrics = { canary: { errors: 0, requests: 0 }, stable: { errors: 0, requests: 0 } };
11 }
12
13 routeRequest(req) {
14 const isCanary = Math.random() * 100 < this.canaryPercentage;
15 return isCanary ? 'canary' : 'stable';
16 }
17
18 recordResult(version, success) {
19 this.metrics[version].requests++;
20 if (!success) this.metrics[version].errors++;
21
22 // Auto-promote or rollback
23 if (this.metrics.canary.requests >= 1000) {
24 const canaryErrorRate = this.metrics.canary.errors / this.metrics.canary.requests;
25 const stableErrorRate = this.metrics.stable.errors / this.metrics.stable.requests;
26
27 if (canaryErrorRate > stableErrorRate * 2) {
28 console.log('🚨 Canary error rate too high — ROLLING BACK');
29 this.canaryPercentage = 0;
30 } else {
31 this.canaryPercentage = Math.min(100, this.canaryPercentage * 2);
32 console.log(`✅ Canary healthy — increasing to \${this.canaryPercentage}%`);
33 }
34 }
35 }
36}
37
38// ---------- Feature Flags ----------
39class FeatureFlags {
40 constructor() {
41 this.flags = new Map();
42 }
43
44 setFlag(name, config) {
45 this.flags.set(name, {
46 enabled: config.enabled || false,
47 percentage: config.percentage || 0,
48 allowlist: config.allowlist || [],
49 rules: config.rules || [],
50 });
51 }
52
53 isEnabled(flagName, userId = null, context = {}) {
54 const flag = this.flags.get(flagName);
55 if (!flag) return false;
56 if (!flag.enabled) return false;
57
58 // Check allowlist (specific users)
59 if (userId && flag.allowlist.includes(userId)) return true;
60
61 // Check percentage rollout
62 if (flag.percentage > 0 && userId) {
63 const hash = this.stableHash(userId + flagName);
64 return (hash % 100) < flag.percentage;
65 }
66
67 return flag.enabled;
68 }
69
70 stableHash(str) {
71 let hash = 0;
72 for (let i = 0; i < str.length; i++) {
73 hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;
74 }
75 return Math.abs(hash);
76 }
77}
78
79// Demo — Feature flags
80const ff = new FeatureFlags();
81ff.setFlag('new_checkout', {
82 enabled: true,
83 percentage: 10, // 10% of users
84 allowlist: ['user_internal1', 'user_beta'],
85});
86
87console.log('Internal user:', ff.isEnabled('new_checkout', 'user_internal1')); // true (allowlist)
88console.log('Random user:', ff.isEnabled('new_checkout', 'user_random')); // ~10% chance
89
90// Canary demo
91const canary = new CanaryDeployment(null);
92for (let i = 0; i < 100; i++) {
93 const version = canary.routeRequest({});
94 canary.recordResult(version, Math.random() > 0.02);
95}
96console.log('Canary metrics:', canary.metrics);

🏋️ Practice Exercise

  1. DR Plan: Design a disaster recovery plan for an e-commerce platform. Define RPO and RTO for each component: user data, orders, product catalog, search index, payment records.

  2. Chaos Engineering: Design 5 chaos experiments for a microservices application. For each: what you inject, what you expect to happen, and how the system should recover.

  3. Blue-Green Deployment: Design a blue-green deployment pipeline for a web application with a database migration. How do you handle backward-incompatible schema changes?

  4. Post-Mortem: Write a post-mortem template for an outage. Include: timeline, root cause, impact, detection, resolution, and action items.

⚠️ Common Mistakes

  • No disaster recovery plan — hoping failures won't happen is not a strategy. Define RPO/RTO, practice failovers, and test backups regularly.

  • Deploying everything at once — big-bang deployments are the highest risk. Use canary or rolling deployments to catch issues before they affect all users.

  • Not testing backups — a backup that can't be restored is worthless. Regularly test restore procedures and measure actual RTO.

  • Feature flags without cleanup — accumulated stale feature flags add complexity and bugs. Remove flags after rollout is complete.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Reliability Engineering & Disaster Recovery