Reliability Engineering & Disaster Recovery
📖 Concept
Site Reliability Engineering (SRE) focuses on building and maintaining reliable systems at scale. Key concepts include redundancy, disaster recovery, and chaos engineering.
Redundancy Levels
| Level | Description | Protects Against |
|---|---|---|
| Server | Multiple app servers behind LB | Single server failure |
| Zone | Replicate across availability zones | Datacenter failure |
| Region | Active in multiple geographic regions | Regional disaster |
Disaster Recovery Strategies
| Strategy | RPO | RTO | Cost |
|---|---|---|---|
| Backup & Restore | Hours | Hours | $ |
| Pilot Light | Minutes | Minutes | $$ |
| Warm Standby | Seconds | Seconds | $$$ |
| Active-Active | Zero | Zero | $$$$ |
RPO (Recovery Point Objective): How much data can you afford to lose? RTO (Recovery Time Objective): How long can you be down?
Chaos Engineering
Intentionally inject failures to test system resilience:
- Kill random servers (Netflix Chaos Monkey)
- Simulate network failures between services
- Inject latency into critical paths
- Fill disk space, exhaust memory
Philosophy: If you're going to have failures in production (and you will), it's better to fail on your terms during business hours than at 3 AM.
Deployment Strategies
| Strategy | Description | Risk |
|---|---|---|
| Rolling | Replace instances gradually | Moderate (rollback possible) |
| Blue-Green | Switch traffic from old to new instantly | Low (instant rollback) |
| Canary | Route small % of traffic to new version | Very low (test with real traffic) |
| Feature Flags | Enable features per user/group | Lowest (toggle instantly) |
Interview tip: Mentioning chaos engineering and deployment strategies shows you think about operational reliability, not just building features.
💻 Code Example
1// ============================================2// Reliability Engineering — Patterns3// ============================================45// ---------- Canary Deployment ----------6class CanaryDeployment {7 constructor(loadBalancer) {8 this.lb = loadBalancer;9 this.canaryPercentage = 5; // Start with 5%10 this.metrics = { canary: { errors: 0, requests: 0 }, stable: { errors: 0, requests: 0 } };11 }1213 routeRequest(req) {14 const isCanary = Math.random() * 100 < this.canaryPercentage;15 return isCanary ? 'canary' : 'stable';16 }1718 recordResult(version, success) {19 this.metrics[version].requests++;20 if (!success) this.metrics[version].errors++;2122 // Auto-promote or rollback23 if (this.metrics.canary.requests >= 1000) {24 const canaryErrorRate = this.metrics.canary.errors / this.metrics.canary.requests;25 const stableErrorRate = this.metrics.stable.errors / this.metrics.stable.requests;2627 if (canaryErrorRate > stableErrorRate * 2) {28 console.log('🚨 Canary error rate too high — ROLLING BACK');29 this.canaryPercentage = 0;30 } else {31 this.canaryPercentage = Math.min(100, this.canaryPercentage * 2);32 console.log(`✅ Canary healthy — increasing to \${this.canaryPercentage}%`);33 }34 }35 }36}3738// ---------- Feature Flags ----------39class FeatureFlags {40 constructor() {41 this.flags = new Map();42 }4344 setFlag(name, config) {45 this.flags.set(name, {46 enabled: config.enabled || false,47 percentage: config.percentage || 0,48 allowlist: config.allowlist || [],49 rules: config.rules || [],50 });51 }5253 isEnabled(flagName, userId = null, context = {}) {54 const flag = this.flags.get(flagName);55 if (!flag) return false;56 if (!flag.enabled) return false;5758 // Check allowlist (specific users)59 if (userId && flag.allowlist.includes(userId)) return true;6061 // Check percentage rollout62 if (flag.percentage > 0 && userId) {63 const hash = this.stableHash(userId + flagName);64 return (hash % 100) < flag.percentage;65 }6667 return flag.enabled;68 }6970 stableHash(str) {71 let hash = 0;72 for (let i = 0; i < str.length; i++) {73 hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;74 }75 return Math.abs(hash);76 }77}7879// Demo — Feature flags80const ff = new FeatureFlags();81ff.setFlag('new_checkout', {82 enabled: true,83 percentage: 10, // 10% of users84 allowlist: ['user_internal1', 'user_beta'],85});8687console.log('Internal user:', ff.isEnabled('new_checkout', 'user_internal1')); // true (allowlist)88console.log('Random user:', ff.isEnabled('new_checkout', 'user_random')); // ~10% chance8990// Canary demo91const canary = new CanaryDeployment(null);92for (let i = 0; i < 100; i++) {93 const version = canary.routeRequest({});94 canary.recordResult(version, Math.random() > 0.02);95}96console.log('Canary metrics:', canary.metrics);
🏋️ Practice Exercise
DR Plan: Design a disaster recovery plan for an e-commerce platform. Define RPO and RTO for each component: user data, orders, product catalog, search index, payment records.
Chaos Engineering: Design 5 chaos experiments for a microservices application. For each: what you inject, what you expect to happen, and how the system should recover.
Blue-Green Deployment: Design a blue-green deployment pipeline for a web application with a database migration. How do you handle backward-incompatible schema changes?
Post-Mortem: Write a post-mortem template for an outage. Include: timeline, root cause, impact, detection, resolution, and action items.
⚠️ Common Mistakes
No disaster recovery plan — hoping failures won't happen is not a strategy. Define RPO/RTO, practice failovers, and test backups regularly.
Deploying everything at once — big-bang deployments are the highest risk. Use canary or rolling deployments to catch issues before they affect all users.
Not testing backups — a backup that can't be restored is worthless. Regularly test restore procedures and measure actual RTO.
Feature flags without cleanup — accumulated stale feature flags add complexity and bugs. Remove flags after rollout is complete.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Reliability Engineering & Disaster Recovery