Three Pillars of Observability

0/2 in this phase0/45 across the roadmap

📖 Concept

Observability is the ability to understand a system's internal state by examining its outputs. The three pillars are: Logs, Metrics, and Traces.

Logs

  • Discrete events with timestamps and context
  • Structured logging (JSON) > unstructured (text)
  • Use log levels: DEBUG, INFO, WARN, ERROR, FATAL
  • Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Datadog, Splunk

Metrics

  • Numeric measurements over time (counters, gauges, histograms)
  • Key metrics: RED method (Rate, Errors, Duration) for services
  • USE method (Utilization, Saturation, Errors) for resources
  • Tools: Prometheus + Grafana, Datadog, CloudWatch

Traces (Distributed Tracing)

  • Track a request as it flows through multiple services
  • Each service adds a span with timing and context
  • Helps identify: which service is slow, where errors originate
  • Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM

Key Metrics to Monitor

Category Metrics
Latency p50, p95, p99 response times
Traffic Requests per second, concurrent connections
Errors Error rate (4xx, 5xx), exception count
Saturation CPU usage, memory usage, queue depth
Business Active users, revenue, conversion rate

SLIs, SLOs, and SLAs

  • SLI (Service Level Indicator): What you measure (e.g., % of requests < 200ms)
  • SLO (Service Level Objective): Your target (e.g., 99.9% of requests < 200ms)
  • SLA (Service Level Agreement): Your promise to customers (e.g., 99.9% uptime or credit)

The Nines of Availability

Availability Downtime/Year Downtime/Month
99% (two nines) 3.65 days 7.3 hours
99.9% (three nines) 8.77 hours 43.8 minutes
99.99% (four nines) 52.6 minutes 4.38 minutes
99.999% (five nines) 5.26 minutes 26.3 seconds

Interview tip: When you mention SLOs in a system design, it shows you think about operational excellence, not just features.

💻 Code Example

codeTap to expand ⛶
1// ============================================
2// Observability — Practical Implementation
3// ============================================
4
5// ---------- Structured Logging ----------
6class Logger {
7 constructor(serviceName) {
8 this.serviceName = serviceName;
9 }
10
11 log(level, message, context = {}) {
12 const entry = {
13 timestamp: new Date().toISOString(),
14 level,
15 service: this.serviceName,
16 message,
17 ...context,
18 traceId: context.traceId || 'no-trace',
19 };
20 console.log(JSON.stringify(entry));
21 }
22
23 info(msg, ctx) { this.log('INFO', msg, ctx); }
24 warn(msg, ctx) { this.log('WARN', msg, ctx); }
25 error(msg, ctx) { this.log('ERROR', msg, ctx); }
26}
27
28// Usage
29const logger = new Logger('order-service');
30logger.info('Order created', { orderId: 'ord_123', userId: 'user_456', total: 99.99, traceId: 'trace_abc' });
31logger.error('Payment failed', { orderId: 'ord_123', error: 'Card declined', traceId: 'trace_abc' });
32
33// ---------- Metrics Collection ----------
34class MetricsCollector {
35 constructor() {
36 this.counters = new Map();
37 this.histograms = new Map();
38 }
39
40 // Counter: things that only go up
41 increment(name, labels = {}) {
42 const key = this.makeKey(name, labels);
43 this.counters.set(key, (this.counters.get(key) || 0) + 1);
44 }
45
46 // Histogram: distribution of values
47 observe(name, value, labels = {}) {
48 const key = this.makeKey(name, labels);
49 if (!this.histograms.has(key)) this.histograms.set(key, []);
50 this.histograms.get(key).push(value);
51 }
52
53 // Calculate percentiles
54 percentile(name, p, labels = {}) {
55 const key = this.makeKey(name, labels);
56 const values = (this.histograms.get(key) || []).sort((a, b) => a - b);
57 if (values.length === 0) return 0;
58 const index = Math.ceil(values.length * (p / 100)) - 1;
59 return values[index];
60 }
61
62 makeKey(name, labels) {
63 return name + JSON.stringify(labels);
64 }
65}
66
67// ---------- Request Monitoring Middleware ----------
68function monitoringMiddleware(metrics, logger) {
69 return (req, res, next) => {
70 const start = Date.now();
71 const traceId = req.headers['x-trace-id'] || generateTraceId();
72
73 // Attach trace ID to request
74 req.traceId = traceId;
75 res.setHeader('x-trace-id', traceId);
76
77 // On response finish, record metrics
78 res.on('finish', () => {
79 const duration = Date.now() - start;
80 const labels = { method: req.method, path: req.route?.path || req.path, status: res.statusCode };
81
82 metrics.increment('http_requests_total', labels);
83 metrics.observe('http_request_duration_ms', duration, labels);
84
85 if (res.statusCode >= 500) {
86 metrics.increment('http_errors_total', labels);
87 logger.error('Request failed', { ...labels, duration, traceId });
88 } else if (duration > 1000) {
89 logger.warn('Slow request', { ...labels, duration, traceId });
90 }
91 });
92
93 next();
94 };
95}
96
97// ---------- Health Check Endpoint ----------
98class HealthChecker {
99 constructor() { this.checks = []; }
100
101 addCheck(name, checkFn) {
102 this.checks.push({ name, check: checkFn });
103 }
104
105 async getHealth() {
106 const results = await Promise.all(
107 this.checks.map(async ({ name, check }) => {
108 try {
109 await check();
110 return { name, status: 'healthy' };
111 } catch (error) {
112 return { name, status: 'unhealthy', error: error.message };
113 }
114 })
115 );
116
117 const healthy = results.every(r => r.status === 'healthy');
118 return { status: healthy ? 'healthy' : 'unhealthy', checks: results, timestamp: new Date().toISOString() };
119 }
120}
121
122function generateTraceId() { return 'trace_' + Math.random().toString(36).slice(2); }
123
124// Demo
125const metrics = new MetricsCollector();
126[50, 80, 120, 200, 500, 45, 90, 150].forEach(d => metrics.observe('latency', d));
127console.log('p50:', metrics.percentile('latency', 50));
128console.log('p99:', metrics.percentile('latency', 99));

🏋️ Practice Exercise

  1. Monitoring Dashboard: Design a monitoring dashboard for a microservices architecture. Include: service health, latency percentiles, error rates, throughput, and business metrics.

  2. SLO Budget: Your API has a 99.9% availability SLO. You've used 50% of your error budget in the first week of the month. Design the response plan: slow down deployments, freeze features, or investigate?

  3. Distributed Tracing: A user reports slow checkout (10s response time). Design how you would trace the request across: API Gateway → Order Service → Payment Service → Inventory Service to find the bottleneck.

  4. Alerting Strategy: Design the alerting strategy: What metrics trigger pages? What's the escalation path? How do you avoid alert fatigue?

⚠️ Common Mistakes

  • Logging too much or too little — logging every request at DEBUG pollutes logs; logging only errors misses context. Use structured logging with appropriate levels.

  • Not monitoring business metrics — technical metrics (CPU, latency) are necessary but insufficient. Track business KPIs (revenue, conversions, user signups) alongside.

  • Alert fatigue — alerting on every anomaly leads to ignored alerts. Use tiered alerting: pages for SLO violations, notifications for warnings, dashboards for everything else.

  • Not having SLOs — without measurable targets, 'reliability' is subjective. Define SLOs (99.9% availability, p99 < 500ms) and track error budgets.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Three Pillars of Observability