sync non-blocking evaluation, and standardized HTTP semantics.
Step 1: Define Probe Semantics
- Liveness: Is the process alive? If false, restart the container. Checks for deadlocks, uncaught exceptions, or memory exhaustion.
- Readiness: Can the instance serve traffic? If false, remove from load balancer. Checks database connectivity, cache warm-up, and downstream API availability.
- Startup: Has the application finished initialization? If false, delay readiness evaluation. Prevents premature traffic routing during boot.
Step 2: Build Async Health Evaluators with Timeout Control
Blocking the main thread during health checks causes cascading latency spikes. All dependency checks must run asynchronously with strict timeout boundaries.
// types/health.ts
export type HealthStatus = 'healthy' | 'degraded' | 'unhealthy';
export type HealthCheckResult = {
status: HealthStatus;
latencyMs: number;
timestamp: string;
details: Record<string, { status: HealthStatus; latencyMs: number; error?: string }>;
};
export type HealthCheckFn = () => Promise<{ status: HealthStatus; latencyMs: number; error?: string }>;
// core/health-evaluator.ts
import { HealthCheckFn, HealthCheckResult, HealthStatus } from '../types/health';
export class HealthEvaluator {
private checks: Map<string, HealthCheckFn> = new Map();
private defaultTimeoutMs = 2000;
register(name: string, fn: HealthCheckFn) {
this.checks.set(name, fn);
}
async evaluate(): Promise<HealthCheckResult> {
const startTime = performance.now();
const details: HealthCheckResult['details'] = {};
const checkPromises = Array.from(this.checks.entries()).map(async ([name, fn]) => {
const checkStart = performance.now();
try {
const result = await Promise.race([
fn(),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error(`Timeout: ${name}`)), this.defaultTimeoutMs)
)
]);
details[name] = {
status: result.status,
latencyMs: Math.round(performance.now() - checkStart),
error: result.error
};
} catch (error) {
details[name] = {
status: 'unhealthy',
latencyMs: Math.round(performance.now() - checkStart),
error: error instanceof Error ? error.message : 'Unknown error'
};
}
});
await Promise.allSettled(checkPromises);
const hasUnhealthy = Object.values(details).some(d => d.status === 'unhealthy');
const hasDegraded = Object.values(details).some(d => d.status === 'degraded');
const overallStatus: HealthStatus = hasUnhealthy
? 'unhealthy'
: hasDegraded
? 'degraded'
: 'healthy';
return {
status: overallStatus,
latencyMs: Math.round(performance.now() - startTime),
timestamp: new Date().toISOString(),
details
};
}
}
Step 3: Implement Dependency Checks with Circuit Breaker Integration
Health checks should not trigger retries or heavy operations. They must reflect current state, not attempt to recover it.
// checks/database-check.ts
import { HealthCheckFn } from '../types/health';
import { dbPool } from '../infrastructure/db';
import { circuitBreaker } from '../infrastructure/circuit-breaker';
export const databaseHealthCheck: HealthCheckFn = async () => {
if (circuitBreaker.isTripped('database')) {
return { status: 'degraded', latencyMs: 0, error: 'Circuit breaker open' };
}
const start = performance.now();
try {
const result = await dbPool.query('SELECT 1');
circuitBreaker.recordSuccess('database');
return { status: 'healthy', latencyMs: Math.round(performance.now() - start) };
} catch (error) {
circuitBreaker.recordFailure('database');
return {
status: 'unhealthy',
latencyMs: Math.round(performance.now() - start),
error: error instanceof Error ? error.message : 'DB query failed'
};
}
};
Step 4: Expose Standardized Endpoints
Separate endpoints prevent orchestrator confusion. Use proper HTTP semantics: 200 for healthy, 503 for unhealthy, 200 with degraded payload for partial readiness.
// routes/health.ts
import { Router } from 'express';
import { HealthEvaluator } from '../core/health-evaluator';
import { databaseHealthCheck } from '../checks/database-check';
import { cacheHealthCheck } from '../checks/cache-check';
import { externalApiHealthCheck } from '../checks/external-api-check';
const router = Router();
const evaluator = new HealthEvaluator();
evaluator.register('database', databaseHealthCheck);
evaluator.register('cache', cacheHealthCheck);
evaluator.register('external-api', externalApiHealthCheck);
// Liveness: process state only
router.get('/health/live', (_req, res) => {
res.status(200).json({ status: 'alive', timestamp: new Date().toISOString() });
});
// Readiness: functional state
router.get('/health/ready', async (req, res) => {
try {
const result = await evaluator.evaluate();
const statusCode = result.status === 'healthy' ? 200 : 503;
res.status(statusCode).json(result);
} catch {
res.status(503).json({ status: 'unhealthy', error: 'Health evaluation failed' });
}
});
// Startup: initialization complete
let startupComplete = false;
router.get('/health/startup', (_req, res) => {
res.status(startupComplete ? 200 : 503).json({
status: startupComplete ? 'initialized' : 'initializing',
timestamp: new Date().toISOString()
});
});
// Mark startup complete after boot sequence
export const markStartupComplete = () => { startupComplete = true; };
Architecture Decisions and Rationale
- Registry Pattern: Decouples health check registration from routing logic. Enables dynamic addition/removal of checks without modifying core evaluation logic.
- Async Non-Blocking Evaluation: Prevents event loop starvation. Health checks run concurrently with
Promise.allSettled, ensuring one failing dependency doesn't block others.
- Strict Timeouts:
Promise.race enforces hard boundaries. External dependencies must not dictate health check latency. Default 2000ms aligns with standard load balancer probe intervals.
- Separate Endpoints: Isolates control-plane concerns. Orchestration systems can target specific probes without parsing response payloads.
- No Retries in Health Checks: Health checks are diagnostic, not remedial. Retries mask true dependency state and increase load on failing systems. Circuit breaker state is read, not modified, during evaluation.
Pitfall Guide
-
Synchronous Blocking Checks
Running database or network calls synchronously blocks the main thread. In Node.js, this halts all request processing until the check completes. Always use async I/O with explicit timeouts.
-
Conflating Liveness and Readiness
Liveness indicates process survival. Readiness indicates traffic capability. A database outage should trigger readiness failure, not liveness failure. Killing pods during dependency degradation causes restart storms and data loss.
-
Missing Timeout Boundaries
Unbounded health checks hang indefinitely when dependencies fail. Load balancers interpret hanging probes as healthy, routing traffic to dead instances. Enforce hard timeouts at the evaluation layer.
-
Returning 200 OK for Degraded States
Partial failures must surface as 503 or structured degraded payloads. Returning 200 with a warning field breaks standard load balancer behavior, which only reads HTTP status codes for routing decisions.
-
Over-Frequent Probing
Health checks running at sub-second intervals create thundering herd effects against databases and caches. Align probe frequency with orchestrator defaults (Kubernetes: 10s interval, 5s timeout). Use caching for expensive checks if necessary.
-
Ignoring Cache Warm-Up States
Applications often report healthy before caches are populated, causing immediate traffic rejection. Implement startup probes that block readiness until critical caches reach minimum threshold.
-
Exposing Internal Metrics in Public Endpoints
Health endpoints are often internet-facing. Returning stack traces, connection strings, or internal topology leaks attack surface. Strip sensitive data in production builds using environment-aware sanitization.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Monolithic on-prem | Basic ping + DB check | Low orchestration complexity, single failure domain | Minimal infra cost, moderate MTTR |
| Kubernetes microservices | Composite/weighted readiness | Auto-scaling and traffic shifting require precise signals | +5-8% CPU, -77% MTTR |
| Event-driven/queue workers | Readiness + queue depth check | Workers must pause consumption when downstream is degraded | +3% overhead, prevents message loss |
| Serverless/lambda | Stateless ping + cold start guard | No persistent connections, health checked per invocation | Near-zero overhead, depends on provider |
Configuration Template
// config/health.config.ts
import { HealthEvaluator } from '../core/health-evaluator';
import { databaseHealthCheck } from '../checks/database-check';
import { cacheHealthCheck } from '../checks/cache-check';
import { redisHealthCheck } from '../checks/redis-check';
export const createHealthEvaluator = () => {
const evaluator = new HealthEvaluator();
// Register checks with custom timeouts if needed
evaluator.register('postgresql', databaseHealthCheck);
evaluator.register('redis', redisHealthCheck);
evaluator.register('cache-layer', cacheHealthCheck);
return evaluator;
};
// Environment overrides
export const HEALTH_CHECK_CONFIG = {
intervalMs: process.env.HEALTH_INTERVAL_MS ? parseInt(process.env.HEALTH_INTERVAL_MS) : 10000,
timeoutMs: process.env.HEALTH_TIMEOUT_MS ? parseInt(process.env.HEALTH_TIMEOUT_MS) : 2000,
startupGracePeriodMs: process.env.STARTUP_GRACE_MS ? parseInt(process.env.STARTUP_GRACE_MS) : 30000,
exposeDetails: process.env.NODE_ENV === 'development',
stripSensitiveKeys: ['password', 'secret', 'token', 'connection_string']
};
Quick Start Guide
- Install dependencies:
npm install express pino @opentelemetry/api
- Create probe endpoints in your router:
/health/live, /health/ready, /health/startup
- Register async dependency checks with 2000ms timeout boundaries using the
HealthEvaluator class
- Configure your orchestrator: Kubernetes
livenessProbe targets /health/live, readinessProbe targets /health/ready
- Validate with
curl -v http://localhost:3000/health/ready and verify HTTP status codes match dependency state