nce with provider quotas.
3. Distributed vs. Local: For single-instance deployments, an in-memory limiter suffices. For distributed systems (e.g., Kubernetes pods), the rate limiter state must be shared via Redis or a similar store to prevent cross-instance limit violations.
4. Priority Queuing: Not all requests are equal. Critical user-facing requests should bypass background batch jobs during limit contention.
TypeScript Implementation
The following implementation provides a production-grade rate limiter. It supports dual RPM/TPM constraints, jitter-based backoff, Retry-After header parsing, and token estimation.
import { createHash } from 'crypto';
export interface RateLimitConfig {
rpm: number;
tpm: number;
model: string;
// Backoff configuration
backoff: {
baseMs: number;
maxMs: number;
jitter: boolean;
};
// Optional: Provider-specific overrides
maxRetries?: number;
}
export interface RateLimitMetrics {
requestsAllowed: number;
requestsQueued: number;
requestsDropped: number;
tokensConsumed: number;
}
class SlidingWindowLimiter {
private rpmWindow: Map<string, number[]> = new Map();
private tpmWindow: Map<string, number[]> = new Map();
private config: RateLimitConfig;
constructor(config: RateLimitConfig) {
this.config = config;
}
async acquire(tokens: number): Promise<boolean> {
const now = Date.now();
const windowMs = 60_000; // 1 minute window
// Clean old entries
this.cleanup(this.rpmWindow, now, windowMs);
this.cleanup(this.tpmWindow, now, windowMs);
const currentRpm = this.rpmWindow.get(this.config.model)?.length || 0;
const currentTpm = this.tpmWindow.get(this.config.model)?.reduce((a, b) => a + b, 0) || 0;
if (currentRpm >= this.config.rpm) return false;
if (currentTpm + tokens > this.config.tpm) return false;
// Reserve capacity
this.rpmWindow.set(this.config.model, [
...(this.rpmWindow.get(this.config.model) || []),
now
]);
this.tpmWindow.set(this.config.model, [
...(this.tpmWindow.get(this.config.model) || []),
tokens
]);
return true;
}
private cleanup(map: Map<string, number[]>, now: number, windowMs: number) {
const cutoff = now - windowMs;
for (const [key, timestamps] of map) {
const filtered = timestamps.filter(t => t > cutoff);
if (filtered.length === 0) {
map.delete(key);
} else {
map.set(key, filtered);
}
}
}
}
export class LLMApiClient {
private limiter: SlidingWindowLimiter;
private metrics: RateLimitMetrics = {
requestsAllowed: 0,
requestsQueued: 0,
requestsDropped: 0,
tokensConsumed: 0
};
constructor(private config: RateLimitConfig) {
this.limiter = new SlidingWindowLimiter(config);
}
async call(
prompt: string,
maxTokens: number,
priority: 'high' | 'low' = 'normal'
): Promise<string> {
const estimatedTokens = this.estimateTokens(prompt, maxTokens);
let retries = 0;
const maxRetries = this.config.maxRetries || 5;
while (retries <= maxRetries) {
const canProceed = await this.limiter.acquire(estimatedTokens);
if (canProceed) {
try {
const response = await this.executeRequest(prompt, maxTokens);
this.metrics.requestsAllowed++;
this.metrics.tokensConsumed += estimatedTokens;
return response;
} catch (error: any) {
const waitTime = this.handleRateLimitError(error, retries);
if (waitTime === null) throw error; // Non-retryable
await this.sleep(waitTime);
retries++;
}
} else {
// Capacity exhausted, wait and retry
await this.sleep(500);
retries++;
this.metrics.requestsQueued++;
}
}
this.metrics.requestsDropped++;
throw new Error(`Rate limit exceeded after ${maxRetries} retries`);
}
private estimateTokens(prompt: string, maxTokens: number): number {
// In production, use tiktoken or provider-specific tokenizer
// Approximation: 1 token β 4 chars for English text
const promptTokens = Math.ceil(prompt.length / 4);
// Conservative estimate for output
return promptTokens + maxTokens;
}
private handleRateLimitError(error: any, retries: number): number | null {
if (error.status !== 429) return null;
// Parse Retry-After header if present
const retryAfter = error.headers?.['retry-after'];
if (retryAfter) {
return parseInt(retryAfter, 10) * 1000;
}
// Exponential backoff with jitter
const base = this.config.backoff.baseMs;
const max = this.config.backoff.maxMs;
const exponential = Math.min(base * Math.pow(2, retries), max);
if (this.config.backoff.jitter) {
const jitter = Math.random() * exponential * 0.5;
return exponential + jitter;
}
return exponential;
}
private async executeRequest(prompt: string, maxTokens: number): Promise<string> {
// Placeholder for actual API call (e.g., fetch, axios, provider SDK)
// Ensure error objects include status and headers for 429 handling
throw new Error('Implementation required');
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
getMetrics(): RateLimitMetrics {
return { ...this.metrics };
}
}
Rationale
- Sliding Window: Provides accurate enforcement of RPM/TPM limits over rolling intervals, avoiding the "boundary burst" issue of fixed windows.
- Token Estimation: The
estimateTokens method reserves capacity before the request is sent. This prevents the limiter from accepting a request that would fail at the provider due to TPM constraints.
- Retry-After Priority: Providers often send a
Retry-After header during 429 responses. Parsing this ensures compliance with the provider's specific cooldown period, which may exceed calculated backoff.
- Jitter: Adding random jitter to backoff intervals prevents the "thundering herd" problem where multiple clients wake up simultaneously and hammer the API again.
Pitfall Guide
- Ignoring TPM Constraints: Focusing solely on RPM limits is the most common error. A request with a 100k token context window may consume 20% of your TPM quota in a single call, even if your RPM is low. Always enforce both constraints.
- Blind Retries Without Jitter: Retrying immediately or with fixed intervals causes synchronized retry storms across distributed instances. Always implement exponential backoff with random jitter to desynchronize retries.
- Underestimating Output Tokens: Token estimation must account for both input and expected output. Underestimating output tokens leads to TPM violations that can only be detected after the request is submitted, wasting quota and incurring costs.
- Hardcoding Limits: Rate limits vary by model, subscription tier, and region. Hardcoding limits in configuration files leads to brittle systems. Implement dynamic limit discovery via provider API responses or configuration management that supports tier-based overrides.
- Missing
Retry-After Parsing: Providers may impose specific cooldown periods during rate limit events. Ignoring the Retry-After header and using generic backoff can result in repeated 429 errors and potential account throttling.
- No Priority Differentiation: Treating all requests equally causes critical user interactions to be delayed by background batch jobs. Implement priority queuing to ensure high-priority requests preempt low-priority ones during limit contention.
- Stateless Rate Limiting in Distributed Systems: In-memory rate limiters do not work across multiple application instances. Without a shared state store (e.g., Redis), each instance will independently enforce limits, causing aggregate traffic to exceed provider quotas.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time Chat Application | Client-side Token-Aware Limiter + Priority Queue | Minimizes latency; ensures user requests are not delayed by background tasks. | Low overhead; prevents retry costs. |
| High-Volume Batch Processing | Distributed Redis Rate Limiter + Aggressive Batching | Shared state ensures aggregate compliance; batching reduces RPM overhead. | Reduces RPM consumption by 60-80%; optimizes TPM usage. |
| Multi-Tenant SaaS | Tenant-Isolated Quotas + Token Bucket | Prevents noisy neighbor issues; allows tiered rate limits per customer. | Enables monetization; protects infrastructure. |
| Cost-Sensitive MVP | In-Memory Limiter + Conservative Estimates | Simplest implementation; low operational complexity. | Minimizes waste; higher risk of underutilization. |
Configuration Template
// rate-limit-config.ts
import { RateLimitConfig } from './LLMApiClient';
export const configs: Record<string, RateLimitConfig> = {
'gpt-4o': {
rpm: 100,
tpm: 3000000,
model: 'gpt-4o',
maxRetries: 5,
backoff: {
baseMs: 1000,
maxMs: 30000,
jitter: true
}
},
'claude-3-sonnet': {
rpm: 200,
tpm: 1000000,
model: 'claude-3-sonnet',
maxRetries: 3,
backoff: {
baseMs: 500,
maxMs: 10000,
jitter: true
}
}
};
// Usage
const client = new LLMApiClient(configs['gpt-4o']);
Quick Start Guide
- Install Dependencies: Add
tiktoken for token estimation and your preferred HTTP client.
npm install tiktoken
- Copy Implementation: Copy the
LLMApiClient and SlidingWindowLimiter classes into your project.
- Configure Limits: Create a configuration object matching your provider's RPM and TPM quotas.
- Wrap API Calls: Replace direct API calls with
client.call(prompt, maxTokens).
const response = await client.call("Explain quantum computing", 500);
- Monitor: Log
client.getMetrics() periodically to track limiter performance and adjust thresholds based on actual usage patterns.