Three Pillars of Observability
📖 Concept
Observability is the ability to understand a system's internal state by examining its outputs. The three pillars are: Logs, Metrics, and Traces.
Logs
- Discrete events with timestamps and context
- Structured logging (JSON) > unstructured (text)
- Use log levels: DEBUG, INFO, WARN, ERROR, FATAL
- Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Datadog, Splunk
Metrics
- Numeric measurements over time (counters, gauges, histograms)
- Key metrics: RED method (Rate, Errors, Duration) for services
- USE method (Utilization, Saturation, Errors) for resources
- Tools: Prometheus + Grafana, Datadog, CloudWatch
Traces (Distributed Tracing)
- Track a request as it flows through multiple services
- Each service adds a span with timing and context
- Helps identify: which service is slow, where errors originate
- Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM
Key Metrics to Monitor
| Category | Metrics |
|---|---|
| Latency | p50, p95, p99 response times |
| Traffic | Requests per second, concurrent connections |
| Errors | Error rate (4xx, 5xx), exception count |
| Saturation | CPU usage, memory usage, queue depth |
| Business | Active users, revenue, conversion rate |
SLIs, SLOs, and SLAs
- SLI (Service Level Indicator): What you measure (e.g., % of requests < 200ms)
- SLO (Service Level Objective): Your target (e.g., 99.9% of requests < 200ms)
- SLA (Service Level Agreement): Your promise to customers (e.g., 99.9% uptime or credit)
The Nines of Availability
| Availability | Downtime/Year | Downtime/Month |
|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours |
| 99.9% (three nines) | 8.77 hours | 43.8 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds |
Interview tip: When you mention SLOs in a system design, it shows you think about operational excellence, not just features.
💻 Code Example
1// ============================================2// Observability — Practical Implementation3// ============================================45// ---------- Structured Logging ----------6class Logger {7 constructor(serviceName) {8 this.serviceName = serviceName;9 }1011 log(level, message, context = {}) {12 const entry = {13 timestamp: new Date().toISOString(),14 level,15 service: this.serviceName,16 message,17 ...context,18 traceId: context.traceId || 'no-trace',19 };20 console.log(JSON.stringify(entry));21 }2223 info(msg, ctx) { this.log('INFO', msg, ctx); }24 warn(msg, ctx) { this.log('WARN', msg, ctx); }25 error(msg, ctx) { this.log('ERROR', msg, ctx); }26}2728// Usage29const logger = new Logger('order-service');30logger.info('Order created', { orderId: 'ord_123', userId: 'user_456', total: 99.99, traceId: 'trace_abc' });31logger.error('Payment failed', { orderId: 'ord_123', error: 'Card declined', traceId: 'trace_abc' });3233// ---------- Metrics Collection ----------34class MetricsCollector {35 constructor() {36 this.counters = new Map();37 this.histograms = new Map();38 }3940 // Counter: things that only go up41 increment(name, labels = {}) {42 const key = this.makeKey(name, labels);43 this.counters.set(key, (this.counters.get(key) || 0) + 1);44 }4546 // Histogram: distribution of values47 observe(name, value, labels = {}) {48 const key = this.makeKey(name, labels);49 if (!this.histograms.has(key)) this.histograms.set(key, []);50 this.histograms.get(key).push(value);51 }5253 // Calculate percentiles54 percentile(name, p, labels = {}) {55 const key = this.makeKey(name, labels);56 const values = (this.histograms.get(key) || []).sort((a, b) => a - b);57 if (values.length === 0) return 0;58 const index = Math.ceil(values.length * (p / 100)) - 1;59 return values[index];60 }6162 makeKey(name, labels) {63 return name + JSON.stringify(labels);64 }65}6667// ---------- Request Monitoring Middleware ----------68function monitoringMiddleware(metrics, logger) {69 return (req, res, next) => {70 const start = Date.now();71 const traceId = req.headers['x-trace-id'] || generateTraceId();7273 // Attach trace ID to request74 req.traceId = traceId;75 res.setHeader('x-trace-id', traceId);7677 // On response finish, record metrics78 res.on('finish', () => {79 const duration = Date.now() - start;80 const labels = { method: req.method, path: req.route?.path || req.path, status: res.statusCode };8182 metrics.increment('http_requests_total', labels);83 metrics.observe('http_request_duration_ms', duration, labels);8485 if (res.statusCode >= 500) {86 metrics.increment('http_errors_total', labels);87 logger.error('Request failed', { ...labels, duration, traceId });88 } else if (duration > 1000) {89 logger.warn('Slow request', { ...labels, duration, traceId });90 }91 });9293 next();94 };95}9697// ---------- Health Check Endpoint ----------98class HealthChecker {99 constructor() { this.checks = []; }100101 addCheck(name, checkFn) {102 this.checks.push({ name, check: checkFn });103 }104105 async getHealth() {106 const results = await Promise.all(107 this.checks.map(async ({ name, check }) => {108 try {109 await check();110 return { name, status: 'healthy' };111 } catch (error) {112 return { name, status: 'unhealthy', error: error.message };113 }114 })115 );116117 const healthy = results.every(r => r.status === 'healthy');118 return { status: healthy ? 'healthy' : 'unhealthy', checks: results, timestamp: new Date().toISOString() };119 }120}121122function generateTraceId() { return 'trace_' + Math.random().toString(36).slice(2); }123124// Demo125const metrics = new MetricsCollector();126[50, 80, 120, 200, 500, 45, 90, 150].forEach(d => metrics.observe('latency', d));127console.log('p50:', metrics.percentile('latency', 50));128console.log('p99:', metrics.percentile('latency', 99));
🏋️ Practice Exercise
Monitoring Dashboard: Design a monitoring dashboard for a microservices architecture. Include: service health, latency percentiles, error rates, throughput, and business metrics.
SLO Budget: Your API has a 99.9% availability SLO. You've used 50% of your error budget in the first week of the month. Design the response plan: slow down deployments, freeze features, or investigate?
Distributed Tracing: A user reports slow checkout (10s response time). Design how you would trace the request across: API Gateway → Order Service → Payment Service → Inventory Service to find the bottleneck.
Alerting Strategy: Design the alerting strategy: What metrics trigger pages? What's the escalation path? How do you avoid alert fatigue?
⚠️ Common Mistakes
Logging too much or too little — logging every request at DEBUG pollutes logs; logging only errors misses context. Use structured logging with appropriate levels.
Not monitoring business metrics — technical metrics (CPU, latency) are necessary but insufficient. Track business KPIs (revenue, conversions, user signups) alongside.
Alert fatigue — alerting on every anomaly leads to ignored alerts. Use tiered alerting: pages for SLO violations, notifications for warnings, dashboards for everything else.
Not having SLOs — without measurable targets, 'reliability' is subjective. Define SLOs (99.9% availability, p99 < 500ms) and track error budgets.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Three Pillars of Observability