- SLO: 99.9% availability over a rolling 30-day window.
2. Instrumentation Strategy
Use OpenTelemetry for standardized instrumentation across languages. This decouples instrumentation from the backend, allowing migration between monitoring stacks without code changes.
TypeScript Instrumentation Example:
Use @opentelemetry/api and @opentelemetry/sdk-metrics to expose custom business metrics.
import { metrics, ValueType } from '@opentelemetry/api';
import { MeterProvider } from '@opentelemetry/sdk-metrics';
const meterProvider = new MeterProvider();
const meter = meterProvider.getMeter('payment-service');
// Counter for transaction outcomes
const transactionCounter = meter.createCounter('transactions_total', {
description: 'Total number of transactions',
valueType: ValueType.INT,
});
// Histogram for processing duration
const durationHistogram = meter.createHistogram('transaction_duration_ms', {
description: 'Transaction processing duration',
unit: 'ms',
valueType: ValueType.INT,
});
export function recordTransaction(status: string, duration: number): void {
const labels = { status, region: process.env.REGION };
transactionCounter.add(1, labels);
durationHistogram.record(duration, labels);
}
3. Recording Rules for Performance
Raw queries on high-cardinality metrics consume excessive CPU and memory. Pre-compute expensive aggregations using recording rules in Prometheus.
groups:
- name: payment_slo_recording_rules
interval: 30s
rules:
- record: job:http_request_duration_seconds:p99:5m
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="payment-service"}[5m])) by (le))
- record: job:http_requests:failure_rate:5m
expr: sum(rate(http_requests_total{job="payment-service", status=~"5.."}[5m])) / sum(rate(http_requests_total{job="payment-service"}[5m]))
4. Burn Rate Alerting
Implement multi-window, multi-burn-rate alerting to detect both fast and slow burns of the error budget.
- alert: SLOHighBurnRate
expr: |
(
sum(rate(http_requests_total{job="payment-service", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-service"}[5m]))
) > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: critical
page: "true"
annotations:
summary: "High error budget burn rate detected"
description: "Error budget will be exhausted in < 1 day at current rate."
runbook_url: "https://runbooks.internal/payment-service/error-budget-burn"
5. Alertmanager Routing and Inhibition
Configure Alertmanager to group related alerts, inhibit duplicates, and route to appropriate channels.
route:
receiver: 'default-pagerduty'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
page: "true"
receiver: 'pagerduty-critical'
continue: false
- match:
severity: warning
receiver: 'slack-warnings'
group_wait: 10s
repeat_interval: 1h
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Pitfall Guide
-
Alerting on Symptoms, Not Causes:
- Mistake: Alerting on high memory usage when the root cause is a memory leak in a specific library.
- Fix: Alert on the business impact (e.g., increased error rate) and use traces/logs to diagnose the root cause. Symptoms change; causes are actionable.
-
Static Thresholds in Autoscaling Environments:
- Mistake: Setting CPU > 80% as a critical alert in a Kubernetes cluster with Horizontal Pod Autoscaler (HPA).
- Fix: Use rate-based metrics and SLOs. If HPA scales up, CPU usage should naturally drop. Alerting on absolute values ignores elasticity.
-
Cardinality Explosion:
- Mistake: Adding high-cardinality labels (e.g.,
user_id, request_id) to metrics.
- Fix: Reserve high-cardinality data for logs and traces. Metrics should have low cardinality labels (e.g.,
service, method, status). High cardinality causes storage bloat and query timeouts.
-
Missing Runbooks:
- Mistake: Alerts fire without linked remediation steps.
- Fix: Every alert must include a
runbook_url annotation. Runbooks should contain diagnostic commands, rollback procedures, and escalation paths.
-
Alert Storms Due to Lack of Grouping:
- Mistake: One underlying issue triggers hundreds of alerts for different instances.
- Fix: Configure
group_by in Alertmanager to aggregate alerts by service and cluster. Use inhibition rules to suppress warnings when critical alerts are active for the same service.
-
Ignoring Alert Lifecycle:
- Mistake: Creating alerts and never reviewing them.
- Fix: Implement a monthly alert review process. Archive alerts that fire less than once per quarter or have a high snooze rate. Alerts must earn their keep.
-
No Synthetic Monitoring:
- Mistake: Relying solely on internal metrics, missing external user-facing issues.
- Fix: Deploy synthetic checks (blackbox exporter) from multiple geographic locations to verify availability and latency from the user's perspective.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Early Stage Startup | Managed SaaS (Datadog/NewRelic) | Zero ops overhead, rapid integration, unified UI | High per-host/per-metric cost |
| High-Scale Microservices | Prometheus + Thanos/Cortex | Horizontal scalability, cost-efficient storage, GitOps friendly | High engineering effort, low infra cost |
| Compliance/Audit Heavy | OpenTelemetry + Centralized Log Aggregator | Standardized tracing, immutable audit trails, data residency control | Medium storage cost, medium compliance cost |
| Kubernetes Native | Kube-Prometheus-Stack | Deep K8s integration, pre-built dashboards, declarative config | Medium cluster resource usage |
Configuration Template
alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.internal:587'
smtp_from: 'alerts@company.com'
route:
receiver: 'default-slack'
group_by: ['alertname', 'namespace', 'service']
group_wait: 10s
group_interval: 2m
repeat_interval: 3h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 5s
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 6h
- match:
team: 'platform'
receiver: 'slack-platform'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<PAGERDUTY_SERVICE_KEY>'
severity: 'critical'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: 'default-slack'
slack_configs:
- api_url: '<SLACK_WEBHOOK_URL>'
channel: '#monitoring-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}*{{ .Annotations.summary }}*{{ end }}'
- name: 'slack-warnings'
slack_configs:
- api_url: '<SLACK_WEBHOOK_URL>'
channel: '#monitoring-warnings'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'namespace', 'service']
Quick Start Guide
-
Deploy Stack:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
-
Add Service Monitor:
Create a ServiceMonitor resource for your application to enable automatic metric scraping.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
path: /metrics
interval: 15s
-
Define First Alert:
Apply a recording rule and alert rule via ConfigMap or PrometheusRule CRD.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
spec:
groups:
- name: my-app
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High 5xx error rate"
-
Verify:
Access Grafana dashboard, confirm metrics ingestion, and simulate an error condition to validate alert routing to Slack/PagerDuty.