Monitoring & Observability

📖 Concept

Monitoring tells you WHEN something is wrong. Observability tells you WHY. Together, they ensure your production Node.js application is reliable and performant.

Three pillars of observability:

  1. Logs — Timestamped records of events (Winston, Pino)
  2. Metrics — Numerical measurements over time (response times, error rates)
  3. Traces — Request flow across services (OpenTelemetry)

Key metrics to monitor:

Category Metrics
Application Response time (p50, p95, p99), error rate, request rate
System CPU usage, memory (RSS, heap), event loop lag
Database Query time, connection pool utilization, slow queries
Business Signups, orders, payments — depends on your domain

Monitoring stack options:

Component Options
Metrics collection Prometheus, StatsD, Datadog Agent
Metrics visualization Grafana, Datadog, New Relic
Log aggregation ELK Stack, Loki + Grafana, Datadog Logs
Error tracking Sentry, Bugsnag, Rollbar
APM (traces) Datadog APM, New Relic, Elastic APM
Uptime monitoring Pingdom, UptimeRobot, Better Uptime

Alerting best practices:

  • Alert on symptoms (high error rate), not causes (high CPU)
  • Use severity levels — critical (pager), warning (Slack), info (dashboard)
  • Prevent alert fatigue — too many alerts = they all get ignored

🏠 Real-world analogy: Monitoring is like a car dashboard — it shows speed (throughput), fuel level (memory), engine temperature (CPU). Observability is like a mechanic's diagnostic tool — it tells you exactly which sensor, cable, or component is failing and why.

💻 Code Example

codeTap to expand ⛶
1// Monitoring & Observability Setup
2
3const express = require("express");
4const client = require("prom-client"); // Prometheus client
5
6const app = express();
7
8// 1. Prometheus metrics setup
9const register = new client.Registry();
10client.collectDefaultMetrics({ register }); // CPU, memory, event loop
11
12// Custom metrics
13const httpRequestDuration = new client.Histogram({
14 name: "http_request_duration_seconds",
15 help: "Duration of HTTP requests in seconds",
16 labelNames: ["method", "route", "status_code"],
17 buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
18});
19register.registerMetric(httpRequestDuration);
20
21const httpRequestTotal = new client.Counter({
22 name: "http_requests_total",
23 help: "Total number of HTTP requests",
24 labelNames: ["method", "route", "status_code"],
25});
26register.registerMetric(httpRequestTotal);
27
28const activeConnections = new client.Gauge({
29 name: "active_connections",
30 help: "Number of active connections",
31});
32register.registerMetric(activeConnections);
33
34// 2. Metrics middleware
35app.use((req, res, next) => {
36 activeConnections.inc();
37 const end = httpRequestDuration.startTimer();
38
39 res.on("finish", () => {
40 const route = req.route?.path || req.path;
41 const labels = { method: req.method, route, status_code: res.statusCode };
42 end(labels);
43 httpRequestTotal.inc(labels);
44 activeConnections.dec();
45 });
46
47 next();
48});
49
50// 3. Metrics endpoint (Prometheus scrapes this)
51app.get("/metrics", async (req, res) => {
52 res.set("Content-Type", register.contentType);
53 res.end(await register.metrics());
54});
55
56// 4. Health check endpoint
57app.get("/health", async (req, res) => {
58 const checks = {
59 uptime: process.uptime(),
60 memory: process.memoryUsage(),
61 timestamp: Date.now(),
62 };
63
64 // Check database
65 try {
66 // await db.query("SELECT 1");
67 checks.database = "healthy";
68 } catch (err) {
69 checks.database = "unhealthy";
70 }
71
72 // Check Redis
73 try {
74 // await redis.ping();
75 checks.cache = "healthy";
76 } catch (err) {
77 checks.cache = "unhealthy";
78 }
79
80 const isHealthy = checks.database === "healthy";
81 res.status(isHealthy ? 200 : 503).json(checks);
82});
83
84// 5. Readiness vs Liveness probes (Kubernetes)
85app.get("/ready", (req, res) => {
86 // Ready to accept traffic?
87 // Check: DB connected, cache connected, migrations applied
88 res.status(200).json({ ready: true });
89});
90
91app.get("/live", (req, res) => {
92 // Is the process alive?
93 // Simple: if this responds, the process is alive
94 res.status(200).json({ alive: true });
95});
96
97// 6. Error tracking (Sentry example)
98// const Sentry = require("@sentry/node");
99// Sentry.init({ dsn: process.env.SENTRY_DSN, environment: process.env.NODE_ENV });
100// app.use(Sentry.Handlers.requestHandler());
101// app.use(Sentry.Handlers.errorHandler());
102
103app.listen(3000, () => console.log("Monitored server on port 3000"));

🏋️ Practice Exercise

Exercises:

  1. Set up Prometheus metrics collection with custom histograms and counters for HTTP requests
  2. Create Grafana dashboards for request rate, error rate, response time (p95), and memory usage
  3. Implement health check, readiness, and liveness probe endpoints
  4. Integrate Sentry for automatic error tracking with source maps
  5. Set up alerting rules: alert when error rate > 5% or p95 response time > 2s
  6. Implement distributed tracing with OpenTelemetry across two microservices

⚠️ Common Mistakes

  • Not monitoring at all — you find out about outages from customers, not your monitoring system

  • Monitoring only server metrics (CPU, memory) without application metrics — you need request rates, error rates, and response times

  • Not distinguishing health/ready/live endpoints — Kubernetes uses them differently: liveness = should restart?, readiness = should receive traffic?

  • Creating too many alerts — alert fatigue means real alerts get ignored; alert only on actionable, customer-impacting issues

  • Not correlating logs, metrics, and traces — without correlation IDs linking all three, debugging distributed issues is nearly impossible

💼 Interview Questions

🎤 Mock Interview

Mock interview is powered by AI for Monitoring & Observability. Login to unlock this feature.