consistently achieve higher availability with lower infrastructure cost.
Core Solution
Implementing a production-grade retry strategy requires four architectural decisions: explicit error classification, bounded backoff with jitter, circuit-state awareness, and observability integration. The following implementation demonstrates these principles in TypeScript.
Step 1: Define Retryable Error Taxonomy
Not all failures warrant retries. Classify errors into three categories:
- Retryable: 5xx server errors, network timeouts, connection resets, 429 rate limits with
Retry-After header
- Non-retryable: 4xx client errors (except 429), authentication failures, malformed requests
- Conditional: Idempotency-dependent operations, partial payloads, degraded but responsive services
Step 2: Implement Bounded Exponential Backoff with Jitter
Jitter prevents synchronized retry bursts. Decorrelated jitter combines fixed and exponential components to guarantee monotonic growth while randomizing timing.
type RetryableError = Error & { status?: number; headers?: Headers };
interface RetryConfig {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
jitterFactor: number;
retryableStatuses: number[];
timeoutMs: number;
}
const DEFAULT_CONFIG: RetryConfig = {
maxAttempts: 3,
baseDelayMs: 100,
maxDelayMs: 5000,
jitterFactor: 0.5,
retryableStatuses: [429, 500, 502, 503, 504],
timeoutMs: 30000,
};
Step 3: Build the Retry Wrapper
The wrapper enforces bounds, respects Retry-After, applies jitter, and integrates with a circuit breaker state.
export class ApiRetryExecutor {
private circuitOpen = false;
private lastFailureTime = 0;
private readonly config: RetryConfig;
constructor(config: Partial<RetryConfig> = {}) {
this.config = { ...DEFAULT_CONFIG, ...config };
}
private isRetryable(status: number): boolean {
return this.config.retryableStatuses.includes(status);
}
private calculateDelay(attempt: number, retryAfter?: number): number {
if (retryAfter) return Math.min(retryAfter * 1000, this.config.maxDelayMs);
const exponential = Math.min(
this.config.baseDelayMs * Math.pow(2, attempt),
this.config.maxDelayMs
);
// Decorrelated jitter: prevents thundering herd
const jitter = exponential * this.config.jitterFactor * Math.random();
return Math.min(exponential + jitter, this.config.maxDelayMs);
}
private shouldRetry(attempt: number, error: RetryableError): boolean {
if (this.circuitOpen) return false;
if (attempt >= this.config.maxAttempts) return false;
const status = error.status ?? 0;
if (!this.isRetryable(status)) return false;
// Open circuit after consecutive failures
if (this.isRetryable(status)) {
this.lastFailureTime = Date.now();
if (attempt === this.config.maxAttempts - 1) {
this.circuitOpen = true;
setTimeout(() => { this.circuitOpen = false; }, 30000);
}
}
return true;
}
async execute<T>(requestFn: () => Promise<T>): Promise<T> {
let lastError: RetryableError | null = null;
for (let attempt = 0; attempt < this.config.maxAttempts; attempt++) {
try {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), this.config.timeoutMs);
const response = await requestFn();
clearTimeout(timeoutId);
// Track success to reset circuit
if (this.circuitOpen && Date.now() - this.lastFailureTime > 10000) {
this.circuitOpen = false;
}
return response;
} catch (error) {
lastError = error as RetryableError;
clearTimeout(timeoutId);
if (!this.shouldRetry(attempt, lastError)) break;
const retryAfter = lastError.headers?.get('Retry-After');
const delay = this.calculateDelay(attempt, retryAfter ? parseInt(retryAfter) : undefined);
await new Promise(res => setTimeout(res, delay));
}
}
throw lastError ?? new Error('Retry execution failed without capturing error');
}
}
Step 4: Architecture Decisions & Rationale
- Bounded Execution:
maxAttempts and timeoutMs prevent resource exhaustion. Unbounded retries are the primary cause of memory leaks and thread starvation in high-throughput services.
- Decorrelated Jitter: Pure random jitter can produce shorter delays than previous attempts, violating monotonic backoff guarantees. Decorrelated jitter ensures delays only increase while randomizing phase alignment across clients.
- Circuit Integration: The lightweight circuit breaker prevents retry storms during prolonged outages. Production systems should replace this with a dedicated circuit breaker library (e.g., Opossum, resilience4j) that tracks failure rates, half-open states, and fallback execution.
- Idempotency Enforcement: Retry wrappers must never be applied to non-idempotent operations without explicit idempotency keys. The execution layer should inject
Idempotency-Key headers for POST/PUT requests to guarantee safe retry semantics.
Pitfall Guide
1. Retrying Non-Idempotent or 4xx Errors
Retrying 400, 401, 403, or 404 responses wastes bandwidth and masks client-side bugs. Non-idempotent POST requests without idempotency keys create duplicate side effects. Always classify errors explicitly and enforce idempotency keys for state-mutating operations.
2. Ignoring Jitter or Using Simple Randomization
Fixed delays cause synchronized retry bursts. Simple Math.random() * delay can produce shorter delays than previous attempts, breaking backoff guarantees. Use decorrelated or full jitter to maintain monotonic growth while desynchronizing clients.
3. Unbounded Retry Loops
Missing maxAttempts or timeoutMs allows retry logic to consume memory and connections indefinitely. In Kubernetes environments, this triggers OOMKills and pod restart cycles that amplify the original failure.
Rate limiters and API gateways communicate recovery windows via Retry-After. Ignoring this header causes premature retries that extend rate limit windows and trigger stricter throttling tiers. Always parse and honor the header when present.
5. Missing Circuit Breaker Fallback
Retrying into a completely degraded service increases mean time to recovery (MTTR). Without a circuit breaker or fallback path, retries consume resources that could serve degraded-mode responses or cached data.
6. Inadequate Retry Observability
Without distinguishing retries from initial requests in metrics and traces, teams cannot measure retry-induced load or correlate latency spikes with backoff misconfigurations. Instrument http.retry.count, http.retry.delay_ms, and http.retry.success at the middleware layer.
7. Hardcoded Delays Instead of Dynamic Adjustment
Static backoff parameters fail under varying load profiles. Adaptive strategies that adjust based on downstream response times, error rates, and queue depth consistently outperform static configurations. Use telemetry-driven backoff tuning in production.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Public third-party API with strict rate limits | Exponential + Jitter + Retry-After parsing | Prevents quota exhaustion and respects provider backoff signals | Low infrastructure cost, higher latency tolerance |
| Internal microservice mesh with known degradation patterns | Adaptive circuit-breaker + dynamic backoff | Reduces downstream load during partial outages, enables graceful degradation | Moderate complexity, lowers compute/network waste |
| High-frequency idempotent writes (event ingestion) | Fixed low delay + idempotency keys + batch retry | Optimizes throughput while guaranteeing exactly-once semantics | Higher retry budget, lower deduplication storage cost |
| Real-time user-facing requests (<200ms SLO) | Single retry + aggressive timeout + fallback cache | Minimizes P99 latency impact, prevents retry-induced timeout cascades | Slightly lower success rate, higher cache hit ratio |
Configuration Template
export const retryProfiles = {
strict_rate_limited: {
maxAttempts: 4,
baseDelayMs: 200,
maxDelayMs: 8000,
jitterFactor: 0.6,
retryableStatuses: [429, 503],
timeoutMs: 15000,
respectRetryAfter: true,
},
internal_mesh: {
maxAttempts: 3,
baseDelayMs: 50,
maxDelayMs: 2000,
jitterFactor: 0.4,
retryableStatuses: [500, 502, 503, 504],
timeoutMs: 5000,
circuitBreakerThreshold: 0.5,
halfOpenProbeInterval: 10000,
},
idempotent_writes: {
maxAttempts: 5,
baseDelayMs: 100,
maxDelayMs: 3000,
jitterFactor: 0.5,
retryableStatuses: [429, 500, 502, 503, 504],
timeoutMs: 10000,
idempotencyKeyHeader: 'Idempotency-Key',
batchRetryEnabled: true,
},
};
Quick Start Guide
- Install dependencies:
npm install @types/node (if not present) and ensure your environment supports AbortController (Node 16+ or modern browsers).
- Define your error taxonomy: Update
retryableStatuses in the config to match your downstream service's failure patterns. Remove non-transient codes like 400, 401, 403.
- Wrap your HTTP client: Replace direct
fetch or axios calls with new ApiRetryExecutor(config).execute(() => client.request(options)).
- Add observability: Emit
retry_attempt, retry_delay_ms, and retry_success metrics at the wrapper boundary. Correlate with distributed trace IDs.
- Validate under load: Use a load testing tool to simulate 429/503 responses. Verify P99 latency remains within SLO and downstream request volume does not spike beyond 1.5x baseline.