n, minimizing false positives while maintaining rapid detection.
Step-by-Step Implementation
1. Instrumentation with OpenTelemetry (TypeScript)
Instrument your application to emit RED metrics (Rate, Errors, Duration) and USE metrics (Utilization, Saturation, Errors) where applicable.
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
// Initialize Prometheus Exporter
const prometheusExporter = new PrometheusExporter(
{ port: 9464, endpoint: '/metrics' },
() => {
console.log('Prometheus scrape endpoint ready at :9464/metrics');
}
);
// Configure SDK
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'payment-service',
environment: 'production',
}),
metricReader: prometheusExporter,
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// Custom Business Metrics
const { MeterProvider } = require('@opentelemetry/api');
const meter = MeterProvider.getMeter('payment-meter');
const paymentDurationHistogram = meter.createHistogram('payment.duration', {
description: 'Duration of payment processing in seconds',
unit: 's',
});
const paymentErrorsCounter = meter.createCounter('payment.errors', {
description: 'Count of payment processing errors',
});
// Usage in request handler
export async function handlePayment(req: Request) {
const start = Date.now();
try {
// Business logic
await processPayment(req.body);
paymentDurationHistogram.record((Date.now() - start) / 1000, { status: 'success' });
} catch (err) {
paymentErrorsCounter.add(1, { type: err.name });
paymentDurationHistogram.record((Date.now() - start) / 1000, { status: 'error' });
throw err;
}
}
2. SLO Definition and Burn Rate Calculation
Define the SLO and calculate the error budget. For a service with 99.9% availability SLO, the error budget is 0.1%.
Burn Rate Math:
A burn rate of 1 means the error budget is consumed at the expected rate. A burn rate of 14.4 means the entire budget will be exhausted in 1 hour.
- Fast Window (1 hour): Detects acute issues. Burn rate threshold: 14.4.
- Slow Window (5 hours): Detects chronic issues. Burn rate threshold: 6.
PromQL Query for Burn Rate:
# Error Budget Burn Rate
# Ratio of error rate to total request rate over windows
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
)
/
(
(1 - 0.999) # Error budget fraction
/ (30 * 24 * 60 * 60) # Budget per second over 30 days
)
3. Alerting Rules Configuration
Configure Prometheus rules to trigger alerts based on MWMR logic.
groups:
- name: payment-service-slo
rules:
# Page Alert: Fast burn, high severity
- alert: CriticalSLOBreach
expr: |
(
sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="payment-service"}[5m]))
) > (14.4 * (1 - 0.999))
and
(
sum(rate(http_requests_total{service="payment-service", status=~"5.."}[1h]))
/ sum(rate(http_requests_total{service="payment-service"}[1h]))
) > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: page
slo: payment-availability
annotations:
summary: "Payment service SLO breach: High error rate detected."
description: "Error budget is burning 14.4x faster than allowed. User impact is likely."
# Ticket Alert: Slow burn, lower severity
- alert: WarningSLOBurn
expr: |
(
sum(rate(http_requests_total{service="payment-service", status=~"5.."}[30m]))
/ sum(rate(http_requests_total{service="payment-service"}[30m]))
) > (6 * (1 - 0.999))
and
(
sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5h]))
/ sum(rate(http_requests_total{service="payment-service"}[5h]))
) > (6 * (1 - 0.999))
for: 5m
labels:
severity: ticket
slo: payment-availability
annotations:
summary: "Payment service error budget depleting rapidly."
description: "Sustained error rate detected. Create a ticket to investigate."
4. Alertmanager Routing
Route alerts based on severity and service labels to appropriate channels.
route:
group_by: ['alertname', 'service', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: page
receiver: 'pagerduty-critical'
continue: false
- match:
severity: ticket
receiver: 'slack-engineering'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<PAGERDUTY_SERVICE_KEY>'
severity: 'critical'
- name: 'slack-engineering'
slack_configs:
- api_url: '<SLACK_WEBHOOK>'
channel: '#ops-alerts'
send_resolved: true
Pitfall Guide
1. Alerting on Symptoms Instead of User Impact
Mistake: Alerting on high CPU usage or memory consumption without correlating to user-facing metrics. High CPU may occur during a batch job that does not affect latency or availability.
Best Practice: Prioritize Golden Signals (Latency, Traffic, Errors, Saturation) that directly correlate to user experience. Only alert on infrastructure metrics if they predict imminent user impact.
2. Ignoring Multi-Window Multi-Burn Rate Logic
Mistake: Using single-window alerts. A single window either misses gradual degradation (short window) or reacts too slowly to spikes (long window).
Best Practice: Always implement MWMR. The fast window catches sudden failures; the slow window filters out transient blips and catches sustained issues. This reduces false positives by orders of magnitude.
3. Hardcoded Thresholds in Dynamic Environments
Mistake: Setting a static CPU threshold of 80% in a Kubernetes cluster where pods auto-scale. The threshold may be irrelevant if the load balancer distributes traffic unevenly or if the pod is being terminated.
Best Practice: Use relative thresholds and SLOs. If infrastructure metrics are necessary, use percentiles relative to historical baselines or dynamic thresholds based on auto-scaling events.
4. Lack of Actionable Runbooks
Mistake: Alerts fire with generic messages like "Service Down" without context or remediation steps. Engineers waste time diagnosing the issue during an incident.
Best Practice: Every alert must link to a runbook. Runbooks should include common causes, diagnostic commands, and one-click remediation scripts. Annotations in alert rules should provide immediate context.
5. Alert Storms and Missing Grouping
Mistake: Firing hundreds of alerts for a single root cause (e.g., a database outage triggering alerts for 50 downstream services).
Best Practice: Configure Alertmanager grouping and inhibition rules. Group alerts by service and root cause. Use inhibition to silence dependent service alerts when a core dependency is down.
6. High Cardinality Metrics Explosion
Mistake: Adding unbounded labels to metrics, such as user IDs or request URLs, causing the metrics database to run out of memory and query performance to degrade.
Best Practice: Limit label cardinality. Use metrics for aggregated data and traces/logs for high-cardinality details. Sanitize labels in instrumentation code to cap unique series counts.
7. No Alert Testing or Drills
Mistake: Assuming alerts work because the configuration is valid. Rules may have syntax errors, label mismatches, or routing misconfigurations that only surface during a real incident.
Best Practice: Implement alert testing in CI/CD pipelines. Use tools like promtool to validate rules. Conduct regular game days to verify alert delivery, routing, and runbook effectiveness.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / MVP | Managed SaaS (Datadog/New Relic) | Low operational overhead, fast setup, built-in dashboards. | Higher per-host cost; scales with usage. |
| Enterprise / Compliance | Self-hosted Prometheus + VictoriaMetrics | Full data control, air-gapped capability, no vendor lock-in. | High engineering overhead for maintenance. |
| Kubernetes Native | Prometheus Operator + OTel | Native CRDs, auto-discovery, seamless integration with K8s lifecycle. | Moderate resource usage for control plane. |
| Serverless / Event-Driven | Push-based Metrics (CloudWatch/X-Ray) | Pull models struggle with ephemeral functions; push fits lifecycle. | Pay-per-metric cost; can spike with high volume. |
Configuration Template
OpenTelemetry Collector Config (otel-collector-config.yaml)
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 10s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 1500
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "otel"
logging:
loglevel: debug
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch, memory_limiter]
exporters: [prometheus]
Alertmanager Silence Template
# silence.yaml
matchers:
- name: alertname
value: HighErrorBudgetBurn
isRegex: false
- name: service
value: payment-service
isRegex: false
startsAt: "2023-10-27T10:00:00Z"
endsAt: "2023-10-27T12:00:00Z"
createdBy: "oncall-engineer"
comment: "Silencing for planned maintenance window."
Quick Start Guide
- Deploy the Stack: Use Helm to deploy Prometheus and Alertmanager.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
- Add OTel Instrumentation: Install
@opentelemetry/sdk-node and @opentelemetry/exporter-prometheus in your TypeScript service. Configure the exporter to point to the Prometheus scrape endpoint.
- Apply SLO Rules: Create a
ConfigMap with your burn rate alerting rules and apply it to the Prometheus configuration. Ensure the rules match your service labels.
- Verify and Test: Access the Prometheus UI, query your metrics, and force a test alert by temporarily lowering the burn rate threshold. Confirm the alert appears in Alertmanager and routes to your notification channel.