ional HTTP/Uptime Monitor | 15–30 mins | ~12% | 0% | Low |
| Log Aggregation + Regex Alerting | 5–10 mins | ~25% | 60% | High |
| Heartbeat Monitoring (Tickstem) | <1 min (configurable) | ~2% | 95% | Low-Medium |
Key Findings:
- Detection Latency: Heartbeat systems alert within the configured
grace window (typically 1–5 minutes past the expected interval), drastically reducing mean time to detect (MTTD) compared to log parsing or manual checks.
- False Positive Reduction: By decoupling alerting from network jitter and process restarts, heartbeat monitoring achieves a ~2% false positive rate when grace windows are properly calibrated.
- Sweet Spot: The optimal configuration aligns
interval with the job's SLA frequency and sets grace window to job_max_runtime + network_variance + 10% buffer. This captures silent failures without triggering alert fatigue during normal execution variance.
Core Solution
Heartbeat monitoring implements a dead-man's switch pattern. Instead of external probes, the scheduled job authenticates and pings the monitoring service upon verified completion. The service tracks the last successful ping timestamp and triggers alerts only when consecutive pings are missing beyond the configured tolerance.
Configuration Parameters:
interval: Expected frequency between successful runs (e.g., 86400 seconds for daily jobs)
grace window: Buffer past the deadline before alerting (e.g., 3600 seconds)
- Alert triggers if no ping arrives for two consecutive intervals
Design Principles:
- Success-only pings ensure alerts only fire when business logic actually completes
- Non-fatal network calls prevent transient connectivity issues from aborting valid workloads
- Token-based authentication isolates credentials and enables granular revocation
Go:
import "github.com/tickstem/heartbeat"
client := heartbeat.New(os.Getenv("TICKSTEM_API_KEY"))
hb, err := client.Create(ctx, heartbeat.CreateParams{
Name: "nightly-sync",
IntervalSecs: 86400,
GraceSecs: 3600,
})
// at the end of every successful run — token is the credential, no API key needed
if err := client.Ping(ctx, hb.Token); err != nil {
log.Println("heartbeat ping failed:", err) // non-fatal
}
Enter fullscreen mode Exit fullscreen mode
Node.js:
import { HeartbeatClient } from "@tickstem/heartbeat"
const hb = new HeartbeatClient(process.env.TICKSTEM_API_KEY)
const heartbeat = await hb.create({ name: "nightly-sync", interval_secs: 86400 })
// at the end of every successful run
await hb.ping(heartbeat.token).catch(err => console.error("ping failed:", err))
Enter fullscreen mode Exit fullscreen mode
Or just curl — no SDK needed:
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_TOKEN/ping
Enter fullscreen mode Exit fullscreen mode
The token goes in the URL. No auth header. If curl fails, the script still exits cleanly.
The thing worth noting
The ping only happens on success. Silence means something went wrong — either the job crashed, was never scheduled, or completed without doing its actual work. That's the point.
Make the ping non-fatal though. A transient network blip shouldn't abort a successful sync.
When to use it
Any job where "it ran" and "it did something useful" are different things:
- Database backups
- Data sync / ETL pipelines
- Report generation
- Invoice or payment processing
- Cache warming
Uptime monitoring and heartbeat monitoring are complementary. Uptime = server is alive. Heartbeat = job actually did its job.
Pitfall Guide
- Pinging on Start or Failure: Triggering the heartbeat at job initialization or in
catch/finally blocks defeats the dead-man's switch. Only ping after all business logic, data validation, and side effects have completed successfully.
- Making Ping Calls Fatal: Wrapping the ping in a fatal error handler (
panic, process.exit(1), or uncaught exceptions) turns a monitoring tool into a single point of failure. Always isolate the ping call, catch network errors, and log them without interrupting the primary workflow.
- Misaligning Grace Windows: Setting the grace window too tight causes alert fatigue during normal execution variance. Setting it too loose delays incident response. Calculate as:
grace = max_expected_runtime + p95_network_latency + 10%_safety_margin.
- Using Master API Keys for Pings: Embedding root credentials in ping calls violates least-privilege principles. Always use scoped heartbeat tokens generated during
Create(). Tokens are revocable, auditable, and limit blast radius if leaked.
- Confusing Interval with Execution Duration: The
interval parameter defines how often the job should run, not how long it takes. A 4-hour ETL job running daily requires interval_secs: 86400, not 14400. Misconfiguration here causes immediate false alerts.
- Ignoring Downstream Dependency Health: A heartbeat confirms the job ran, not that upstream data sources or downstream consumers are healthy. Pair heartbeat monitoring with data quality checks (row counts, checksums, schema validation) for end-to-end reliability.
Deliverables
📐 Architecture Blueprint
- System flow diagram: Job execution → Success validation → Heartbeat ping → Tickstem state tracking → Alert routing
- Configuration matrix for common workload types (DB backups, ETL pipelines, report generators, payment processors)
- Token lifecycle management: creation, rotation, revocation, and audit logging strategies
✅ Pre-Deployment Checklist