Auto-Scaling & Capacity Planning

0/3 in this phase0/45 across the roadmap

📖 Concept

Auto-scaling automatically adjusts the number of server instances based on real-time demand. It's essential for handling traffic that varies throughout the day, week, or during special events.

Scaling Triggers

Metric	Scale Up When	Scale Down When
CPU	> 70% average across instances	< 30% average
Memory	> 80% utilization	< 40% utilization
Request count	> 1000 req/sec per instance	< 200 req/sec
Queue depth	> 1000 pending messages	< 100 pending
Custom metric	p99 latency > 500ms	p99 latency < 100ms

Scaling Policies

Reactive Scaling

Scale based on current metrics. Simple but always slightly behind.

Predictive Scaling

Use historical patterns to scale BEFORE traffic arrives. E.g., scale up at 8 AM every weekday because traffic increases predictably.

Scheduled Scaling

Pre-configured scaling for known events: Black Friday, product launches, marketing campaigns.

Capacity Planning Formula

Required instances = Peak QPS / QPS per instance × (1 + safety margin)

Example: Peak 100K QPS, each server handles 5K → 100K/5K × 1.3 = 26 instances

Key Concepts

Cooldown period: After scaling, wait N minutes before scaling again (prevent thrashing)
Min/Max instances: Set floor (minimum for availability) and ceiling (cost control)
Warm-up time: New instances take time to start JVM, load caches, establish connections
Graceful shutdown: Drain connections before terminating instances

Interview tip: Mention auto-scaling in any design where traffic is variable. It shows you understand operational efficiency and cost optimization.

💻 Code Example

codeTap to expand ⛶

1// ============================================
2// Auto-Scaling — Policy Implementation
3// ============================================
4
5class AutoScaler {
6  constructor(config) {
7    this.minInstances = config.min || 2;
8    this.maxInstances = config.max || 20;
9    this.currentInstances = this.minInstances;
10    this.cooldownMs = config.cooldownMs || 300000; // 5 minutes
11    this.lastScaleTime = 0;
12    this.metrics = [];
13  }
14
15  evaluate(currentMetrics) {
16    const now = Date.now();
17    if (now - this.lastScaleTime < this.cooldownMs) {
18      console.log('⏳ In cooldown period, skipping evaluation');
19      return;
20    }
21
22    const avgCPU = currentMetrics.avgCPU;
23    const avgLatency = currentMetrics.p99Latency;
24    const qps = currentMetrics.requestsPerSecond;
25
26    // Scale UP conditions
27    if (avgCPU > 70 || avgLatency > 500 || qps > 5000 * this.currentInstances) {
28      const newCount = Math.min(
29        this.maxInstances,
30        Math.ceil(this.currentInstances * 1.5) // Scale 50% at a time
31      );
32      if (newCount > this.currentInstances) {
33        this.scaleUp(newCount);
34      }
35    }
36
37    // Scale DOWN conditions
38    if (avgCPU < 30 && avgLatency < 100 && qps < 2000 * this.currentInstances) {
39      const newCount = Math.max(
40        this.minInstances,
41        Math.floor(this.currentInstances * 0.75) // Scale down 25% at a time
42      );
43      if (newCount < this.currentInstances) {
44        this.scaleDown(newCount);
45      }
46    }
47  }
48
49  scaleUp(targetCount) {
50    const toAdd = targetCount - this.currentInstances;
51    console.log(`📈 Scaling UP: \${this.currentInstances} → \${targetCount} (+\${toAdd})`);
52    this.currentInstances = targetCount;
53    this.lastScaleTime = Date.now();
54  }
55
56  scaleDown(targetCount) {
57    const toRemove = this.currentInstances - targetCount;
58    console.log(`📉 Scaling DOWN: \${this.currentInstances} → \${targetCount} (-\${toRemove})`);
59    // Drain connections before removing instances
60    console.log(`   Draining \${toRemove} instances (30s grace period)`);
61    this.currentInstances = targetCount;
62    this.lastScaleTime = Date.now();
63  }
64}
65
66// ---------- Capacity Planning Calculator ----------
67function calculateCapacity(requirements) {
68  const { peakQPS, qpsPerInstance, safetyMargin = 0.3 } = requirements;
69  const baseInstances = Math.ceil(peakQPS / qpsPerInstance);
70  const withSafety = Math.ceil(baseInstances * (1 + safetyMargin));
71
72  return {
73    baseInstances,
74    withSafetyMargin: withSafety,
75    totalCost: withSafety * requirements.instanceCostPerHour,
76    note: `\${peakQPS} QPS / \${qpsPerInstance} per instance × \${1 + safetyMargin} safety = \${withSafety}`,
77  };
78}
79
80// Demo
81const scaler = new AutoScaler({ min: 2, max: 20 });
82scaler.evaluate({ avgCPU: 80, p99Latency: 600, requestsPerSecond: 15000 });
83scaler.lastScaleTime = 0; // Reset cooldown for demo
84scaler.evaluate({ avgCPU: 20, p99Latency: 50, requestsPerSecond: 3000 });
85
86console.log('Capacity plan:', calculateCapacity({
87  peakQPS: 100000,
88  qpsPerInstance: 5000,
89  safetyMargin: 0.3,
90  instanceCostPerHour: 0.10,
91}));

🏋️ Practice Exercise

Auto-Scaling Policy: Design auto-scaling policies for: (a) an API server (CPU-based), (b) a Kafka consumer (queue-depth-based), (c) a WebSocket server (connection-count-based).
Capacity Planning: Your service handles 10K req/sec normally and 50K during peak (3 hours/day). Each instance handles 2K req/sec and costs $0.10/hour. Calculate: minimum instances, peak instances, and daily cost with auto-scaling vs fixed capacity.
Predictive Scaling: Design a predictive scaling system that learns from the last 4 weeks of traffic patterns and pre-scales before predicted peaks.
Thundering Herd: After a deployment, all instances restart simultaneously with cold caches. Design a rolling deployment strategy that prevents this.

⚠️ Common Mistakes

Not setting a cooldown period — without cooldown, the auto-scaler can thrash between scaling up and down every minute, wasting resources on instance startup/shutdown.
Scaling based on a single metric — CPU might be low but latency high (due to I/O wait). Use multiple metrics and scale on the first one that triggers.
Not accounting for startup time — new instances need 30-120 seconds to start, warm caches, and become ready. Scale proactively, not reactively.
Setting max instances too low — during a viral event, traffic might be 10x normal. If your max is 5x, the system crashes. Set generous maximums with cost alerts.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Auto-Scaling & Capacity Planning

Was this topic helpful?

← PreviousLoad Balancing Next →CAP Theorem