Auto-Scaling & Capacity Planning
📖 Concept
Auto-scaling automatically adjusts the number of server instances based on real-time demand. It's essential for handling traffic that varies throughout the day, week, or during special events.
Scaling Triggers
| Metric | Scale Up When | Scale Down When |
|---|---|---|
| CPU | > 70% average across instances | < 30% average |
| Memory | > 80% utilization | < 40% utilization |
| Request count | > 1000 req/sec per instance | < 200 req/sec |
| Queue depth | > 1000 pending messages | < 100 pending |
| Custom metric | p99 latency > 500ms | p99 latency < 100ms |
Scaling Policies
Reactive Scaling
Scale based on current metrics. Simple but always slightly behind.
Predictive Scaling
Use historical patterns to scale BEFORE traffic arrives. E.g., scale up at 8 AM every weekday because traffic increases predictably.
Scheduled Scaling
Pre-configured scaling for known events: Black Friday, product launches, marketing campaigns.
Capacity Planning Formula
Required instances = Peak QPS / QPS per instance × (1 + safety margin)
Example: Peak 100K QPS, each server handles 5K → 100K/5K × 1.3 = 26 instances
Key Concepts
- Cooldown period: After scaling, wait N minutes before scaling again (prevent thrashing)
- Min/Max instances: Set floor (minimum for availability) and ceiling (cost control)
- Warm-up time: New instances take time to start JVM, load caches, establish connections
- Graceful shutdown: Drain connections before terminating instances
Interview tip: Mention auto-scaling in any design where traffic is variable. It shows you understand operational efficiency and cost optimization.
💻 Code Example
1// ============================================2// Auto-Scaling — Policy Implementation3// ============================================45class AutoScaler {6 constructor(config) {7 this.minInstances = config.min || 2;8 this.maxInstances = config.max || 20;9 this.currentInstances = this.minInstances;10 this.cooldownMs = config.cooldownMs || 300000; // 5 minutes11 this.lastScaleTime = 0;12 this.metrics = [];13 }1415 evaluate(currentMetrics) {16 const now = Date.now();17 if (now - this.lastScaleTime < this.cooldownMs) {18 console.log('⏳ In cooldown period, skipping evaluation');19 return;20 }2122 const avgCPU = currentMetrics.avgCPU;23 const avgLatency = currentMetrics.p99Latency;24 const qps = currentMetrics.requestsPerSecond;2526 // Scale UP conditions27 if (avgCPU > 70 || avgLatency > 500 || qps > 5000 * this.currentInstances) {28 const newCount = Math.min(29 this.maxInstances,30 Math.ceil(this.currentInstances * 1.5) // Scale 50% at a time31 );32 if (newCount > this.currentInstances) {33 this.scaleUp(newCount);34 }35 }3637 // Scale DOWN conditions38 if (avgCPU < 30 && avgLatency < 100 && qps < 2000 * this.currentInstances) {39 const newCount = Math.max(40 this.minInstances,41 Math.floor(this.currentInstances * 0.75) // Scale down 25% at a time42 );43 if (newCount < this.currentInstances) {44 this.scaleDown(newCount);45 }46 }47 }4849 scaleUp(targetCount) {50 const toAdd = targetCount - this.currentInstances;51 console.log(`📈 Scaling UP: \${this.currentInstances} → \${targetCount} (+\${toAdd})`);52 this.currentInstances = targetCount;53 this.lastScaleTime = Date.now();54 }5556 scaleDown(targetCount) {57 const toRemove = this.currentInstances - targetCount;58 console.log(`📉 Scaling DOWN: \${this.currentInstances} → \${targetCount} (-\${toRemove})`);59 // Drain connections before removing instances60 console.log(` Draining \${toRemove} instances (30s grace period)`);61 this.currentInstances = targetCount;62 this.lastScaleTime = Date.now();63 }64}6566// ---------- Capacity Planning Calculator ----------67function calculateCapacity(requirements) {68 const { peakQPS, qpsPerInstance, safetyMargin = 0.3 } = requirements;69 const baseInstances = Math.ceil(peakQPS / qpsPerInstance);70 const withSafety = Math.ceil(baseInstances * (1 + safetyMargin));7172 return {73 baseInstances,74 withSafetyMargin: withSafety,75 totalCost: withSafety * requirements.instanceCostPerHour,76 note: `\${peakQPS} QPS / \${qpsPerInstance} per instance × \${1 + safetyMargin} safety = \${withSafety}`,77 };78}7980// Demo81const scaler = new AutoScaler({ min: 2, max: 20 });82scaler.evaluate({ avgCPU: 80, p99Latency: 600, requestsPerSecond: 15000 });83scaler.lastScaleTime = 0; // Reset cooldown for demo84scaler.evaluate({ avgCPU: 20, p99Latency: 50, requestsPerSecond: 3000 });8586console.log('Capacity plan:', calculateCapacity({87 peakQPS: 100000,88 qpsPerInstance: 5000,89 safetyMargin: 0.3,90 instanceCostPerHour: 0.10,91}));
🏋️ Practice Exercise
Auto-Scaling Policy: Design auto-scaling policies for: (a) an API server (CPU-based), (b) a Kafka consumer (queue-depth-based), (c) a WebSocket server (connection-count-based).
Capacity Planning: Your service handles 10K req/sec normally and 50K during peak (3 hours/day). Each instance handles 2K req/sec and costs $0.10/hour. Calculate: minimum instances, peak instances, and daily cost with auto-scaling vs fixed capacity.
Predictive Scaling: Design a predictive scaling system that learns from the last 4 weeks of traffic patterns and pre-scales before predicted peaks.
Thundering Herd: After a deployment, all instances restart simultaneously with cold caches. Design a rolling deployment strategy that prevents this.
⚠️ Common Mistakes
Not setting a cooldown period — without cooldown, the auto-scaler can thrash between scaling up and down every minute, wasting resources on instance startup/shutdown.
Scaling based on a single metric — CPU might be low but latency high (due to I/O wait). Use multiple metrics and scale on the first one that triggers.
Not accounting for startup time — new instances need 30-120 seconds to start, warm caches, and become ready. Scale proactively, not reactively.
Setting max instances too low — during a viral event, traffic might be 10x normal. If your max is 5x, the system crashes. Set generous maximums with cost alerts.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Auto-Scaling & Capacity Planning