MervCodes

Tech Reviews From A Programmer

Monitoring Node.js in Production: Metrics, Logs, Alerts

1 min read

A Node.js service that passes every test in CI can still fall over in production. Memory creeps up overnight, the event loop stalls under load, a downstream API starts timing out, and the first you hear of it is an angry customer. Monitoring is how you find out before the customer does. The three pillars — metrics, logs, and alerts — each answer a different question: Is something wrong? What exactly happened? Who needs to wake up? This guide walks through setting up all three for a real Node.js service.

Why Node.js Needs Special Attention

Node's single-threaded, event-loop-driven model means a few failure modes that don't show up in multi-threaded runtimes. A single synchronous CPU-bound operation can block every concurrent request. An unhandled promise rejection can silently kill a request path. Memory leaks accumulate because long-lived processes never restart on their own. Generic infrastructure monitoring (CPU, RAM, disk) won't catch these — you need runtime-aware instrumentation.

The good news: the Node ecosystem has mature, low-overhead tooling for all of this. You don't need to build it yourself.

Metrics: The Numbers That Tell You Something Is Wrong

Metrics are cheap, aggregatable time-series numbers. They're what you graph on dashboards and what you alert on. The de facto standard is Prometheus for collection and Grafana for visualization, with the prom-client library exposing metrics from your app.

const client = require('prom-client');
const express = require('express');

const app = express();
const register = new client.Registry();

// Collect default Node.js metrics: event loop lag, heap, GC, handles
client.collectDefaultMetrics({ register });

// A custom histogram for HTTP request duration
const httpDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
register.registerMetric(httpDuration);

app.use((req, res, next) => {
  const end = httpDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path ?? 'unknown', status_code: res.statusCode });
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

collectDefaultMetrics gives you the Node-specific signals that matter most:

  • Event loop lag (nodejs_eventloop_lag_seconds) — the single most important Node health metric. If lag climbs above a few hundred milliseconds, your process is blocked and requests are queuing.
  • Heap usage (nodejs_heap_size_used_bytes) — a steady upward slope across restarts is a memory leak.
  • Active handles and requests — leaking sockets or file descriptors show up here.
  • Garbage collection duration — long GC pauses correlate with latency spikes.

On top of the defaults, instrument the four golden signals for each endpoint: latency, traffic, errors, and saturation. The histogram above covers latency and (via labels) errors. Add a counter for total requests and you have traffic. Track these per-route, not just globally — a 99th-percentile latency averaged across all routes hides the one slow endpoint that's hurting users.

A practical tip: prefer histograms over gauges for latency so you can compute real percentiles (p50, p95, p99). Averages lie. A 50ms average can hide a p99 of 3 seconds.

Logs: The Detail When You Need to Investigate

Metrics tell you that error rates spiked at 2:14 AM. Logs tell you why. The cardinal rule in production: log structured JSON, not strings. Structured logs are queryable; string logs are grep-and-pray.

Use Pino — it's the fastest Node logger and outputs JSON by default with minimal overhead.

const pino = require('pino');

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  redact: ['req.headers.authorization', 'password', '*.creditCard'],
  formatters: {
    level: (label) => ({ level: label }),
  },
});

logger.info({ userId: 42, route: '/checkout', durationMs: 128 }, 'request completed');
logger.error({ err, orderId: 'ord_123' }, 'payment provider timeout');

Principles that pay off later:

  • Use log levels deliberately. error for things that need human attention, warn for recoverable degradation, info for business events, debug for development. Run production at info and turn on debug selectively.
  • Attach a correlation/request ID to every log line in a request's lifecycle. With async context tracking (AsyncLocalStorage), you can thread a single request ID through every function without passing it as an argument. When debugging, you filter on one ID and see the entire request story.
  • Never log secrets. Use Pino's redact option to strip tokens, passwords, and PII at the logger level so a careless logger.info(req.headers) doesn't leak credentials.
  • Don't log in hot loops. Logging is I/O. A debug log inside a per-item loop processing 10,000 items will dominate your latency.

Ship logs off the box. A logging pipeline — Loki, the ELK/OpenSearch stack, or a hosted service like Datadog — lets you search across instances and retain history after a container is recycled. Pino pairs naturally with transports that forward to these without blocking your event loop.

Alerts: Getting Woken Up Only When It Matters

Dashboards are useless at 3 AM because nobody is looking at them. Alerts close the loop by pushing a notification when a metric crosses a threshold. The hard part isn't wiring up Alertmanager or PagerDuty — it's deciding what to alert on without drowning in noise.

Alert on symptoms, not causes. Your users don't care that CPU is high; they care that checkout is failing. Alert on the user-facing symptoms — elevated error rate, latency above SLO, the health check failing — and use metrics and logs to diagnose the cause once you're paged. A good starting set:

  • Error raterate(http_requests_total{status_code=~"5.."}[5m]) exceeds a meaningful fraction of traffic for a sustained window.
  • Latency SLO breach — p99 latency above your target (e.g. 1s) for a sustained window.
  • Event loop lag — sustained lag above 200ms means the process is unhealthy.
  • Heap approaching limit — heap used approaching the --max-old-space-size ceiling indicates serious memory pressure and predicts an imminent out-of-memory crash if the trend continues.
  • Crash looping / restart rate — the process restarting repeatedly.

Two rules keep alerts trustworthy:

  1. Require a duration. Alert on "above threshold for 5 minutes," not on a single scrape. Transient spikes are normal; sustained ones are incidents. This single change eliminates most false pages.
  2. Every alert must be actionable. If a page fires and the on-call's only move is to acknowledge and go back to sleep, delete that alert. Alert fatigue is real — once people start ignoring pages, the monitoring system has failed regardless of how comprehensive it is.

Tier your severities: a page (wakes someone up) for user-facing outages, a ticket or Slack message for slow-burning issues like a gradual memory leak you can fix during business hours.

Health Checks and Graceful Shutdown

Two cheap additions make the rest of your monitoring work. Expose a /health (liveness) and /ready (readiness) endpoint. Liveness answers "is the process alive?"; readiness answers "can it serve traffic?" — readiness should check downstream dependencies (database, cache) so a load balancer stops sending traffic to an instance that can't fulfill it.

Handle SIGTERM for graceful shutdown so in-flight requests complete before the process exits during a deploy. Without it, every rolling deploy drops a handful of requests and pollutes your error metrics with noise that masks real problems.

Putting It Together

A solid baseline stack: prom-client exposing metrics scraped by Prometheus, visualized in Grafana; Pino emitting structured JSON shipped to Loki or OpenSearch; and Alertmanager (or your APM vendor) firing symptom-based alerts to PagerDuty or Slack. If you'd rather not run the infrastructure, a hosted APM like Datadog, New Relic, or Grafana Cloud bundles all three with OpenTelemetry auto-instrumentation — a reasonable choice for small teams.

Start small. Add the four golden signals and event-loop-lag metric, structured logging with request IDs, and three or four symptom-based alerts. That covers the vast majority of real incidents. Expand from there as you learn which questions you keep wishing you could answer.

FAQ

How much overhead does monitoring add? Metrics collection with prom-client is negligible — sub-millisecond per request. Pino is among the fastest loggers and adds little at info level. The real cost is logging too verbosely (debug logs in hot paths) or computing high-cardinality metrics. Keep label values bounded — never use user IDs or raw URLs as label values, or you'll create millions of unique time series and overwhelm Prometheus.

Metrics or logs — which should I add first? Metrics. They tell you something is wrong cheaply and are what you alert on. Logs are for the investigation after a metric or alert points you at a problem. Most teams get the best early return from the four golden signals plus event-loop lag.

What's the difference between liveness and readiness probes? Liveness checks whether the process should be restarted (it's hung or dead). Readiness checks whether it should receive traffic right now (dependencies are reachable). Conflating them causes restart loops — if your readiness check fails because the database blips, you don't want the orchestrator killing an otherwise-healthy process.

Do I need distributed tracing too? Tracing is the natural next step once metrics and logs are in place, especially in microservices where one request hops across many services. OpenTelemetry provides Node auto-instrumentation that captures traces, metrics, and logs together. For a single service, structured logs with correlation IDs get you most of the way; add tracing when you're debugging across service boundaries.

How do I catch memory leaks? Graph nodejs_heap_size_used_bytes over days, not minutes. A leak shows as a sawtooth that trends upward — GC reclaims some memory but the baseline keeps climbing across restarts. When you spot one, capture a heap snapshot in production and diff snapshots taken an hour apart to find what's retained.

Should I alert on CPU and memory? Alert on them as secondary signals, not primary ones. Page on user-facing symptoms (errors, latency); use resource saturation alerts as lower-severity warnings that help you catch problems before they become outages. Heap nearing the V8 limit is worth an alert because it predicts a crash; raw CPU usually isn't.


Sources

Related Articles