Monitoring & Observability
📖 Concept
Monitoring tells you WHEN something is wrong. Observability tells you WHY. Together, they ensure your production Node.js application is reliable and performant.
Three pillars of observability:
- Logs — Timestamped records of events (Winston, Pino)
- Metrics — Numerical measurements over time (response times, error rates)
- Traces — Request flow across services (OpenTelemetry)
Key metrics to monitor:
| Category | Metrics |
|---|---|
| Application | Response time (p50, p95, p99), error rate, request rate |
| System | CPU usage, memory (RSS, heap), event loop lag |
| Database | Query time, connection pool utilization, slow queries |
| Business | Signups, orders, payments — depends on your domain |
Monitoring stack options:
| Component | Options |
|---|---|
| Metrics collection | Prometheus, StatsD, Datadog Agent |
| Metrics visualization | Grafana, Datadog, New Relic |
| Log aggregation | ELK Stack, Loki + Grafana, Datadog Logs |
| Error tracking | Sentry, Bugsnag, Rollbar |
| APM (traces) | Datadog APM, New Relic, Elastic APM |
| Uptime monitoring | Pingdom, UptimeRobot, Better Uptime |
Alerting best practices:
- Alert on symptoms (high error rate), not causes (high CPU)
- Use severity levels — critical (pager), warning (Slack), info (dashboard)
- Prevent alert fatigue — too many alerts = they all get ignored
🏠 Real-world analogy: Monitoring is like a car dashboard — it shows speed (throughput), fuel level (memory), engine temperature (CPU). Observability is like a mechanic's diagnostic tool — it tells you exactly which sensor, cable, or component is failing and why.
💻 Code Example
1// Monitoring & Observability Setup23const express = require("express");4const client = require("prom-client"); // Prometheus client56const app = express();78// 1. Prometheus metrics setup9const register = new client.Registry();10client.collectDefaultMetrics({ register }); // CPU, memory, event loop1112// Custom metrics13const httpRequestDuration = new client.Histogram({14 name: "http_request_duration_seconds",15 help: "Duration of HTTP requests in seconds",16 labelNames: ["method", "route", "status_code"],17 buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],18});19register.registerMetric(httpRequestDuration);2021const httpRequestTotal = new client.Counter({22 name: "http_requests_total",23 help: "Total number of HTTP requests",24 labelNames: ["method", "route", "status_code"],25});26register.registerMetric(httpRequestTotal);2728const activeConnections = new client.Gauge({29 name: "active_connections",30 help: "Number of active connections",31});32register.registerMetric(activeConnections);3334// 2. Metrics middleware35app.use((req, res, next) => {36 activeConnections.inc();37 const end = httpRequestDuration.startTimer();3839 res.on("finish", () => {40 const route = req.route?.path || req.path;41 const labels = { method: req.method, route, status_code: res.statusCode };42 end(labels);43 httpRequestTotal.inc(labels);44 activeConnections.dec();45 });4647 next();48});4950// 3. Metrics endpoint (Prometheus scrapes this)51app.get("/metrics", async (req, res) => {52 res.set("Content-Type", register.contentType);53 res.end(await register.metrics());54});5556// 4. Health check endpoint57app.get("/health", async (req, res) => {58 const checks = {59 uptime: process.uptime(),60 memory: process.memoryUsage(),61 timestamp: Date.now(),62 };6364 // Check database65 try {66 // await db.query("SELECT 1");67 checks.database = "healthy";68 } catch (err) {69 checks.database = "unhealthy";70 }7172 // Check Redis73 try {74 // await redis.ping();75 checks.cache = "healthy";76 } catch (err) {77 checks.cache = "unhealthy";78 }7980 const isHealthy = checks.database === "healthy";81 res.status(isHealthy ? 200 : 503).json(checks);82});8384// 5. Readiness vs Liveness probes (Kubernetes)85app.get("/ready", (req, res) => {86 // Ready to accept traffic?87 // Check: DB connected, cache connected, migrations applied88 res.status(200).json({ ready: true });89});9091app.get("/live", (req, res) => {92 // Is the process alive?93 // Simple: if this responds, the process is alive94 res.status(200).json({ alive: true });95});9697// 6. Error tracking (Sentry example)98// const Sentry = require("@sentry/node");99// Sentry.init({ dsn: process.env.SENTRY_DSN, environment: process.env.NODE_ENV });100// app.use(Sentry.Handlers.requestHandler());101// app.use(Sentry.Handlers.errorHandler());102103app.listen(3000, () => console.log("Monitored server on port 3000"));
🏋️ Practice Exercise
Exercises:
- Set up Prometheus metrics collection with custom histograms and counters for HTTP requests
- Create Grafana dashboards for request rate, error rate, response time (p95), and memory usage
- Implement health check, readiness, and liveness probe endpoints
- Integrate Sentry for automatic error tracking with source maps
- Set up alerting rules: alert when error rate > 5% or p95 response time > 2s
- Implement distributed tracing with OpenTelemetry across two microservices
⚠️ Common Mistakes
Not monitoring at all — you find out about outages from customers, not your monitoring system
Monitoring only server metrics (CPU, memory) without application metrics — you need request rates, error rates, and response times
Not distinguishing health/ready/live endpoints — Kubernetes uses them differently: liveness = should restart?, readiness = should receive traffic?
Creating too many alerts — alert fatigue means real alerts get ignored; alert only on actionable, customer-impacting issues
Not correlating logs, metrics, and traces — without correlation IDs linking all three, debugging distributed issues is nearly impossible
💼 Interview Questions
🎤 Mock Interview
Mock interview is powered by AI for Monitoring & Observability. Login to unlock this feature.