linearly.
Core Solution
Selection must be followed by a disciplined implementation strategy. The following steps ensure the chosen database performs optimally.
Step 1: Workload Characterization
Before deployment, quantify the following:
- Cardinality: Total unique series =
Metric Count × Label Combinations.
- Ingestion Rate: Points per second (PPS).
- Query Pattern: Percentage of range queries vs. instant lookups vs. heavy aggregations.
- Retention: Hot (high-res), Warm (downsampled), Cold (archive).
Step 2: Schema and Label Design
Time-series databases rely on labels/tags for indexing. Poor label design causes series explosion.
- Rule: Never use high-cardinality values (e.g., UUIDs, IP addresses, user IDs) as labels unless strictly necessary for querying.
- Rule: Use labels for filtering and grouping (e.g.,
region, service, host).
- Rule: Store attributes that are never filtered as fields/metrics, not labels.
Step 3: Implementation Architecture
For high-throughput systems, implement a write buffer and downsampling pipeline.
TypeScript Implementation: High-Performance Write Client
This example demonstrates a batched write client with backpressure handling and label sanitization, applicable to VictoriaMetrics or InfluxDB line protocols.
interface MetricPoint {
metric: string;
tags: Record<string, string>;
value: number;
timestamp?: number;
}
class TimeSeriesWriter {
private batch: MetricPoint[] = [];
private flushInterval: number;
private batchSize: number;
private endpoint: string;
private isFlushing: boolean = false;
constructor(config: { endpoint: string; batchSize?: number; flushInterval?: number }) {
this.endpoint = config.endpoint;
this.batchSize = config.batchSize || 1000;
this.flushInterval = config.flushInterval || 5000;
setInterval(() => this.flush(), this.flushInterval);
}
/**
* Adds a point to the batch.
* Implements basic backpressure by rejecting if batch is full.
*/
write(point: MetricPoint): boolean {
if (this.batch.length >= this.batchSize) {
// Backpressure: Flush immediately or drop
this.flush();
if (this.batch.length >= this.batchSize) {
console.warn('Backpressure: Dropping point due to flush latency');
return false;
}
}
// Sanitize tags to prevent cardinality explosion
const sanitizedTags: Record<string, string> = {};
for (const [key, value] of Object.entries(point.tags)) {
if (value.length < 256 && !/[{}"=]/.test(value)) {
sanitizedTags[key] = value;
}
}
this.batch.push({
...point,
tags: sanitizedTags,
timestamp: point.timestamp || Date.now()
});
return true;
}
private async flush(): Promise<void> {
if (this.batch.length === 0 || this.isFlushing) return;
this.isFlushing = true;
const batchToFlush = [...this.batch];
this.batch = [];
try {
const payload = batchToFlush
.map(p => this.toLineProtocol(p))
.join('\n');
const response = await fetch(this.endpoint, {
method: 'POST',
headers: { 'Content-Type': 'text/plain' },
body: payload
});
if (!response.ok) {
// Retry logic or dead-letter queue implementation required here
console.error(`Write failed: ${response.status}`);
// Re-queue failed batch in production
}
} catch (error) {
console.error('Flush network error:', error);
} finally {
this.isFlushing = false;
}
}
private toLineProtocol(point: MetricPoint): string {
const tagStr = Object.entries(point.tags)
.map(([k, v]) => `${k}=${v}`)
.join(',');
const tagPrefix = tagStr ? `,${tagStr}` : '';
const ts = point.timestamp! * 1_000_000; // Nanoseconds
return `${point.metric}${tagPrefix} value=${point.value} ${ts}`;
}
}
// Usage
const writer = new TimeSeriesWriter({
endpoint: 'http://victoria-metrics:8428/api/v1/import/prometheus'
});
writer.write({
metric: 'cpu_usage',
tags: { host: 'server-01', region: 'us-east' },
value: 0.75
});
Step 4: Downsampling Strategy
Implement continuous queries or materialized views to reduce storage costs and accelerate long-range queries.
- Hot Tier: Raw data, 15s resolution, 7-day retention.
- Warm Tier: 1m resolution averages, 90-day retention.
- Cold Tier: 1h resolution averages/max, 365-day retention.
Pitfall Guide
-
High Cardinality Labels:
- Mistake: Adding
user_id or request_id as a tag.
- Impact: Series count explodes, causing index bloat, memory exhaustion, and query timeouts.
- Fix: Use labels only for low-cardinality dimensions. Store high-cardinality data in a separate log store or relational table.
-
Out-of-Order Writes:
- Mistake: Sending data with timestamps older than the current write head.
- Impact: Many TSDBs rewrite data or create separate chunks for out-of-order points, severely degrading write performance and compression.
- Fix: Ensure clients sync clocks via NTP. Buffer and sort writes client-side if network latency causes jitter.
-
Ignoring Retention Policies:
- Mistake: Storing raw metrics indefinitely.
- Impact: Storage costs grow linearly; query performance degrades as data volume increases.
- Fix: Configure automatic retention policies and downsampling at the database level.
-
Treating TSDB as SQL:
- Mistake: Writing complex joins across millions of series in a TSDB.
- Impact: Query engine struggles with set operations on time-series; latency spikes.
- Fix: Use TSDBs for time-based aggregations. Push complex relational logic to a data warehouse or application layer.
-
Pull vs. Push Model Mismatch:
- Mistake: Using a pull-based model (Prometheus) for ephemeral or edge devices.
- Impact: Scraping fails when instances scale or move; data gaps occur.
- Fix: Use push-based models (VictoriaMetrics, InfluxDB) for dynamic environments; reserve pull-based for stable, long-lived targets.
-
Compression Misconfiguration:
- Mistake: Disabling compression to "improve" query speed.
- Impact: IO bandwidth becomes the bottleneck; storage costs skyrocket.
- Fix: Modern TSDBs use compression that often speeds up queries by reducing IO. Keep compression enabled unless profiling proves otherwise.
-
Cluster Quorum Latency:
- Mistake: Deploying a clustered TSDB across regions without understanding replication lag.
- Impact: Write latency increases; potential data inconsistency during network partitions.
- Fix: Deploy clusters within a single region. Use cross-region replication for disaster recovery, not active-active writes.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Kubernetes Monitoring | VictoriaMetrics or Prometheus | Native K8s integration; efficient scraping; scalable storage options. | Low to Medium |
| IoT Telemetry (High Volume) | InfluxDB or TimescaleDB | Flexible schema handles varying device payloads; high write throughput. | Medium |
| Financial Analytics | TimescaleDB | SQL compatibility allows complex joins with transactional data; robust tooling. | High |
| Long-Term Cold Storage | VictoriaMetrics | Superior compression ratios minimize storage costs for archival data. | Very Low |
| Edge/Offline Devices | InfluxDB (Push) | Push model handles intermittent connectivity better than pull-based scraping. | Medium |
| Multi-Tenant SaaS | TimescaleDB or M3 | Strong isolation capabilities; SQL-based access control; enterprise features. | High |
Configuration Template
VictoriaMetrics Single-Node Deployment (Docker Compose)
Optimized for production with retention, compression, and storage path configuration.
version: '3.8'
services:
victoriametrics:
image: victoriametrics/victoria-metrics:v1.93.0
ports:
- "8428:8428"
volumes:
- vm-data:/data
command:
- '--storageDataPath=/data'
- '--retentionPeriod=12'
- '--maxLabelsPerTimeseries=30'
- '--search.maxQueryDuration=30s'
- '--envflag.enable=true'
environment:
- VM_ENABLE_ENVIRONMENT_VARIABLES=true
deploy:
resources:
limits:
memory: 4G
restart: unless-stopped
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- victoriametrics
volumes:
vm-data:
driver: local
Key Config Flags:
--retentionPeriod: Automatically drops data older than specified months.
--maxLabelsPerTimeseries: Prevents cardinality explosion by rejecting series with excessive labels.
--search.maxQueryDuration: Protects the cluster from runaway queries.
Quick Start Guide
- Deploy: Run the Docker Compose template above. Access VictoriaMetrics at
http://localhost:8428.
- Ingest Data: Use
curl or the TypeScript client to write a sample metric.
echo 'cpu_usage{host="test"} 0.85' | curl -X POST --data-binary @- http://localhost:8428/api/v1/import/prometheus
- Verify Query: Query the metric via the API.
curl 'http://localhost:8428/api/v1/query?query=cpu_usage'
- Visualize: Open Grafana (
http://localhost:3000), add VictoriaMetrics as a data source, and create a dashboard panel using the query cpu_usage.
- Monitor: Check
/metrics endpoint on VictoriaMetrics to verify vm_cache_entries_count and vm_rows_added_to_storage_total are increasing as expected.