Three Pillars of Observability

0/2 in this phase0/45 across the roadmap

📖 Concept

Observability is the ability to understand a system's internal state by examining its outputs. The three pillars are: Logs, Metrics, and Traces.

Logs

Discrete events with timestamps and context
Structured logging (JSON) > unstructured (text)
Use log levels: DEBUG, INFO, WARN, ERROR, FATAL
Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Datadog, Splunk

Metrics

Numeric measurements over time (counters, gauges, histograms)
Key metrics: RED method (Rate, Errors, Duration) for services
USE method (Utilization, Saturation, Errors) for resources
Tools: Prometheus + Grafana, Datadog, CloudWatch

Traces (Distributed Tracing)

Track a request as it flows through multiple services
Each service adds a span with timing and context
Helps identify: which service is slow, where errors originate
Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM

Key Metrics to Monitor

Category	Metrics
Latency	p50, p95, p99 response times
Traffic	Requests per second, concurrent connections
Errors	Error rate (4xx, 5xx), exception count
Saturation	CPU usage, memory usage, queue depth
Business	Active users, revenue, conversion rate

SLIs, SLOs, and SLAs

SLI (Service Level Indicator): What you measure (e.g., % of requests < 200ms)
SLO (Service Level Objective): Your target (e.g., 99.9% of requests < 200ms)
SLA (Service Level Agreement): Your promise to customers (e.g., 99.9% uptime or credit)

The Nines of Availability

Availability	Downtime/Year	Downtime/Month
99% (two nines)	3.65 days	7.3 hours
99.9% (three nines)	8.77 hours	43.8 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes
99.999% (five nines)	5.26 minutes	26.3 seconds

Interview tip: When you mention SLOs in a system design, it shows you think about operational excellence, not just features.

💻 Code Example

codeTap to expand ⛶

1// ============================================
2// Observability — Practical Implementation
3// ============================================
4
5// ---------- Structured Logging ----------
6class Logger {
7  constructor(serviceName) {
8    this.serviceName = serviceName;
9  }
10
11  log(level, message, context = {}) {
12    const entry = {
13      timestamp: new Date().toISOString(),
14      level,
15      service: this.serviceName,
16      message,
17      ...context,
18      traceId: context.traceId || 'no-trace',
19    };
20    console.log(JSON.stringify(entry));
21  }
22
23  info(msg, ctx) { this.log('INFO', msg, ctx); }
24  warn(msg, ctx) { this.log('WARN', msg, ctx); }
25  error(msg, ctx) { this.log('ERROR', msg, ctx); }
26}
27
28// Usage
29const logger = new Logger('order-service');
30logger.info('Order created', { orderId: 'ord_123', userId: 'user_456', total: 99.99, traceId: 'trace_abc' });
31logger.error('Payment failed', { orderId: 'ord_123', error: 'Card declined', traceId: 'trace_abc' });
32
33// ---------- Metrics Collection ----------
34class MetricsCollector {
35  constructor() {
36    this.counters = new Map();
37    this.histograms = new Map();
38  }
39
40  // Counter: things that only go up
41  increment(name, labels = {}) {
42    const key = this.makeKey(name, labels);
43    this.counters.set(key, (this.counters.get(key) || 0) + 1);
44  }
45
46  // Histogram: distribution of values
47  observe(name, value, labels = {}) {
48    const key = this.makeKey(name, labels);
49    if (!this.histograms.has(key)) this.histograms.set(key, []);
50    this.histograms.get(key).push(value);
51  }
52
53  // Calculate percentiles
54  percentile(name, p, labels = {}) {
55    const key = this.makeKey(name, labels);
56    const values = (this.histograms.get(key) || []).sort((a, b) => a - b);
57    if (values.length === 0) return 0;
58    const index = Math.ceil(values.length * (p / 100)) - 1;
59    return values[index];
60  }
61
62  makeKey(name, labels) {
63    return name + JSON.stringify(labels);
64  }
65}
66
67// ---------- Request Monitoring Middleware ----------
68function monitoringMiddleware(metrics, logger) {
69  return (req, res, next) => {
70    const start = Date.now();
71    const traceId = req.headers['x-trace-id'] || generateTraceId();
72
73    // Attach trace ID to request
74    req.traceId = traceId;
75    res.setHeader('x-trace-id', traceId);
76
77    // On response finish, record metrics
78    res.on('finish', () => {
79      const duration = Date.now() - start;
80      const labels = { method: req.method, path: req.route?.path || req.path, status: res.statusCode };
81
82      metrics.increment('http_requests_total', labels);
83      metrics.observe('http_request_duration_ms', duration, labels);
84
85      if (res.statusCode >= 500) {
86        metrics.increment('http_errors_total', labels);
87        logger.error('Request failed', { ...labels, duration, traceId });
88      } else if (duration > 1000) {
89        logger.warn('Slow request', { ...labels, duration, traceId });
90      }
91    });
92
93    next();
94  };
95}
96
97// ---------- Health Check Endpoint ----------
98class HealthChecker {
99  constructor() { this.checks = []; }
100
101  addCheck(name, checkFn) {
102    this.checks.push({ name, check: checkFn });
103  }
104
105  async getHealth() {
106    const results = await Promise.all(
107      this.checks.map(async ({ name, check }) => {
108        try {
109          await check();
110          return { name, status: 'healthy' };
111        } catch (error) {
112          return { name, status: 'unhealthy', error: error.message };
113        }
114      })
115    );
116
117    const healthy = results.every(r => r.status === 'healthy');
118    return { status: healthy ? 'healthy' : 'unhealthy', checks: results, timestamp: new Date().toISOString() };
119  }
120}
121
122function generateTraceId() { return 'trace_' + Math.random().toString(36).slice(2); }
123
124// Demo
125const metrics = new MetricsCollector();
126[50, 80, 120, 200, 500, 45, 90, 150].forEach(d => metrics.observe('latency', d));
127console.log('p50:', metrics.percentile('latency', 50));
128console.log('p99:', metrics.percentile('latency', 99));

🏋️ Practice Exercise

Monitoring Dashboard: Design a monitoring dashboard for a microservices architecture. Include: service health, latency percentiles, error rates, throughput, and business metrics.
SLO Budget: Your API has a 99.9% availability SLO. You've used 50% of your error budget in the first week of the month. Design the response plan: slow down deployments, freeze features, or investigate?
Distributed Tracing: A user reports slow checkout (10s response time). Design how you would trace the request across: API Gateway → Order Service → Payment Service → Inventory Service to find the bottleneck.
Alerting Strategy: Design the alerting strategy: What metrics trigger pages? What's the escalation path? How do you avoid alert fatigue?

⚠️ Common Mistakes

Logging too much or too little — logging every request at DEBUG pollutes logs; logging only errors misses context. Use structured logging with appropriate levels.
Not monitoring business metrics — technical metrics (CPU, latency) are necessary but insufficient. Track business KPIs (revenue, conversions, user signups) alongside.
Alert fatigue — alerting on every anomaly leads to ignored alerts. Use tiered alerting: pages for SLO violations, notifications for warnings, dashboards for everything else.
Not having SLOs — without measurable targets, 'reliability' is subjective. Define SLOs (99.9% availability, p99 < 500ms) and track error budgets.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Three Pillars of Observability

Was this topic helpful?

← PreviousProbabilistic Data Structures Next →Reliability Engineering & Disaster Recovery