Monitoring & Observability

0/3 in this phase0/48 across the roadmap

📖 Concept

Monitoring tells you WHEN something is wrong. Observability tells you WHY. Together, they ensure your production Node.js application is reliable and performant.

Three pillars of observability:

Logs — Timestamped records of events (Winston, Pino)
Metrics — Numerical measurements over time (response times, error rates)
Traces — Request flow across services (OpenTelemetry)

Key metrics to monitor:

Category	Metrics
Application	Response time (p50, p95, p99), error rate, request rate
System	CPU usage, memory (RSS, heap), event loop lag
Database	Query time, connection pool utilization, slow queries
Business	Signups, orders, payments — depends on your domain

Monitoring stack options:

Component	Options
Metrics collection	Prometheus, StatsD, Datadog Agent
Metrics visualization	Grafana, Datadog, New Relic
Log aggregation	ELK Stack, Loki + Grafana, Datadog Logs
Error tracking	Sentry, Bugsnag, Rollbar
APM (traces)	Datadog APM, New Relic, Elastic APM
Uptime monitoring	Pingdom, UptimeRobot, Better Uptime

Alerting best practices:

Alert on symptoms (high error rate), not causes (high CPU)
Use severity levels — critical (pager), warning (Slack), info (dashboard)
Prevent alert fatigue — too many alerts = they all get ignored

🏠 Real-world analogy: Monitoring is like a car dashboard — it shows speed (throughput), fuel level (memory), engine temperature (CPU). Observability is like a mechanic's diagnostic tool — it tells you exactly which sensor, cable, or component is failing and why.

💻 Code Example

codeTap to expand ⛶

1// Monitoring & Observability Setup
2
3const express = require("express");
4const client = require("prom-client"); // Prometheus client
5
6const app = express();
7
8// 1. Prometheus metrics setup
9const register = new client.Registry();
10client.collectDefaultMetrics({ register }); // CPU, memory, event loop
11
12// Custom metrics
13const httpRequestDuration = new client.Histogram({
14  name: "http_request_duration_seconds",
15  help: "Duration of HTTP requests in seconds",
16  labelNames: ["method", "route", "status_code"],
17  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
18});
19register.registerMetric(httpRequestDuration);
20
21const httpRequestTotal = new client.Counter({
22  name: "http_requests_total",
23  help: "Total number of HTTP requests",
24  labelNames: ["method", "route", "status_code"],
25});
26register.registerMetric(httpRequestTotal);
27
28const activeConnections = new client.Gauge({
29  name: "active_connections",
30  help: "Number of active connections",
31});
32register.registerMetric(activeConnections);
33
34// 2. Metrics middleware
35app.use((req, res, next) => {
36  activeConnections.inc();
37  const end = httpRequestDuration.startTimer();
38
39  res.on("finish", () => {
40    const route = req.route?.path || req.path;
41    const labels = { method: req.method, route, status_code: res.statusCode };
42    end(labels);
43    httpRequestTotal.inc(labels);
44    activeConnections.dec();
45  });
46
47  next();
48});
49
50// 3. Metrics endpoint (Prometheus scrapes this)
51app.get("/metrics", async (req, res) => {
52  res.set("Content-Type", register.contentType);
53  res.end(await register.metrics());
54});
55
56// 4. Health check endpoint
57app.get("/health", async (req, res) => {
58  const checks = {
59    uptime: process.uptime(),
60    memory: process.memoryUsage(),
61    timestamp: Date.now(),
62  };
63
64  // Check database
65  try {
66    // await db.query("SELECT 1");
67    checks.database = "healthy";
68  } catch (err) {
69    checks.database = "unhealthy";
70  }
71
72  // Check Redis
73  try {
74    // await redis.ping();
75    checks.cache = "healthy";
76  } catch (err) {
77    checks.cache = "unhealthy";
78  }
79
80  const isHealthy = checks.database === "healthy";
81  res.status(isHealthy ? 200 : 503).json(checks);
82});
83
84// 5. Readiness vs Liveness probes (Kubernetes)
85app.get("/ready", (req, res) => {
86  // Ready to accept traffic?
87  // Check: DB connected, cache connected, migrations applied
88  res.status(200).json({ ready: true });
89});
90
91app.get("/live", (req, res) => {
92  // Is the process alive?
93  // Simple: if this responds, the process is alive
94  res.status(200).json({ alive: true });
95});
96
97// 6. Error tracking (Sentry example)
98// const Sentry = require("@sentry/node");
99// Sentry.init({ dsn: process.env.SENTRY_DSN, environment: process.env.NODE_ENV });
100// app.use(Sentry.Handlers.requestHandler());
101// app.use(Sentry.Handlers.errorHandler());
102
103app.listen(3000, () => console.log("Monitored server on port 3000"));

🏋️ Practice Exercise

Exercises:

Set up Prometheus metrics collection with custom histograms and counters for HTTP requests
Create Grafana dashboards for request rate, error rate, response time (p95), and memory usage
Implement health check, readiness, and liveness probe endpoints
Integrate Sentry for automatic error tracking with source maps
Set up alerting rules: alert when error rate > 5% or p95 response time > 2s
Implement distributed tracing with OpenTelemetry across two microservices

⚠️ Common Mistakes

Not monitoring at all — you find out about outages from customers, not your monitoring system
Monitoring only server metrics (CPU, memory) without application metrics — you need request rates, error rates, and response times
Not distinguishing health/ready/live endpoints — Kubernetes uses them differently: liveness = should restart?, readiness = should receive traffic?
Creating too many alerts — alert fatigue means real alerts get ignored; alert only on actionable, customer-impacting issues
Not correlating logs, metrics, and traces — without correlation IDs linking all three, debugging distributed issues is nearly impossible

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Monitoring & Observability

Was this topic helpful?

← PreviousCI/CD & Deployment Strategies Next →Microservices Architecture