Cutting LLM API Spend by 62% and P99 Latency by 450ms with Semantic Request Coalescing and Adaptive Context Pruning
By Codcompass Team··10 min read
Current Situation Analysis
We migrated our customer support agent to an LLM-driven architecture six months ago. Within three weeks, the API bill hit $18,000/month, and our P99 latency jittered between 800ms and 2.4s. The root cause wasn't the model choice; it was how we treated the API.
Most tutorials treat LLM calls like standard HTTP requests. You send a prompt, you get a response. This approach fails in production for three reasons:
String-Caching Blindness: Standard caching keys on exact string matches. A user asking "What's my order status?" and "Status of order #4492" generates two API calls, even though the semantic intent is identical. This inflates costs by 30-40% in conversational apps.
Context Window Bloat: Developers naively append every message to history. As conversations lengthen, token counts explode. We saw context windows hitting 45k tokens for simple queries, paying for irrelevant history while pushing latency past acceptable thresholds.
Blind Retries: When the provider returns a 429 or 500, the default SDK retry logic repeats the exact same expensive request. During provider outages, this amplifies load and costs without increasing success probability.
The Bad Approach:
// ANTI-PATTERN: Naive implementation
async function getResponse(userMsg: string, history: Message[]) {
// 1. Sends full history regardless of size
// 2. No caching
// 3. No retry budgeting
const res = await openai.chat.completions.create({
model: 'gpt-4o-mini-2024-07-18',
messages: [...history, { role: 'user', content: userMsg }],
stream: false
});
return res.choices[0].message.content;
}
This code burns cash on redundant calls, slows down as history grows, and fails catastrophically under load. We needed a paradigm shift: treat LLM calls as expensive, probabilistic database queries that require semantic indexing, context management, and financial guardrails.
WOW Moment
The breakthrough came when we stopped optimizing individual requests and started optimizing the request stream.
We implemented Semantic Request Coalescing. Instead of caching results after the fact, we intercept in-flight requests. If multiple users (or retries) trigger semantically similar prompts within a 200ms window, we merge them into a single LLM call. The result is distributed to all waiters.
Combined with Adaptive Context Pruning that dynamically compresses history based on token budgets, and a Cost-Aware Retry Budget that degrades gracefully during outages, we achieved:
62% reduction in monthly API spend.
P99 latency drop from 980ms to 530ms.
Zero context-length errors in production.
The "aha" moment: You pay for tokens, not intelligence. Your job is to minimize tokens while preserving intent, and to ensure you never pay twice for the same answer.
Core Solution
We use the following stack versions:
Runtime: Node.js 22.4.0 (LTS)
Language: TypeScript 5.5.2
Cache/Vector DB: Redis 7.4.2 (with RediSearch)
LLM SDK: OpenAI Node SDK 4.52.0
Embedding Model: text-embedding-3-small
1. Semantic Cache with Request Coalescing
Standard Redis caching is insufficient. We use Redis Vector Search for semantic similarity and a Coalescer class to merge in-flight requests. This prevents duplicate work for identical intents.
Implementation Details:
We generate embeddings for the user prompt.
We query Redis for vectors within a cosine similarity threshold of 0.92.
If a hit exists, we return the cached completion immediately.
If no hit, we check a coalescingMap. If a similar request is in-flight (within 200ms), we attach to its Promise.
This handles burst traffic and duplicate user actions.
// 3. Check semantic match
if (results.documents.length > 0) {
const doc = results.documents[0];
const distance = Number(doc.value.distance);
const similarity = 1 - distance;
if (similarity >= SEMANTIC_THRESHOLD) {
// Cache Hit
return {
content: doc.value.content,
model: doc.value.model,
tokensUsed: doc.value.tokensUsed,
};
}
}
// 4. Request Coalescing
// Hash the embedding to create a coalescing key
// In prod, use a robust hash of the top-k vector components
const coalesceKey = hashVector(embedding);
const existingPromise = coalescingMap.get(coalesceKey);
if (existingPromise) {
return existingPromise;
}
// 5. Execute and Store
const executionPromise = (async () => {
try {
const res = await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
temperature: 0.2,
});
const content = res.choices[0].message.content || '';
const tokensUsed = res.usage?.total_tokens || 0;
const entry = { content, model, tokensUsed };
// Store in Redis with vector
await redis.ft.add('llm-cache:idx', uuidv4(), {
content,
model,
tokensUsed: String(tokensUsed),
embedding: Buffer.from(new Float32Array(embedding).buffer),
}, {
REPLACE: true,
TTL: CACHE_TTL_SECONDS,
});
return entry;
} finally {
// Cleanup coalescing map after window
setTimeout(() => coalescingMap.delete(coalesceKey), COALESCE_WINDOW_MS);
}
})();
coalescingMap.set(coalesceKey, executionPromise);
return executionPromise;
} catch (error) {
// Production-grade error handling
if (error instanceof OpenAI.APIError) {
console.error([LLM-Error] ${error.status}: ${error.message});
throw new Error(LLM API failed: ${error.status});
}
console.error('[Cache-Error]', error);
// Fallback to direct call if cache fails, but log metrics
const fallbackRes = await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
});
return {
content: fallbackRes.choices[0].message.content || '',
model,
tokensUsed: fallbackRes.usage?.total_tokens || 0,
};
}
}
function hashVector(vec: number[]): string {
// Simple hash for demonstration; use MurmurHash3 in production
return vec.slice(0, 8).map(v => Math.round(v * 100)).join(':');
}
### 2. Adaptive Context Pruning
Sending full history is the #1 cause of cost spikes. We implement a pruning strategy that maintains the system prompt and the last `N` messages, but compresses the middle based on keyword density. This preserves recency while retaining key entities.
```typescript
// context-pruner.ts
import { OpenAI } from 'openai';
const openai = new OpenAI();
export interface Message {
role: 'system' | 'user' | 'assistant';
content: string;
}
export async function pruneContext(
messages: Message[],
maxTokens: number = 8000,
keepRecent: number = 4
): Promise<Message[]> {
// 1. Calculate current token count (approximate)
let totalTokens = 0;
const tokenCounts = messages.map(msg => {
const count = Math.ceil(msg.content.length / 4); // Rough estimate
totalTokens += count;
return count;
});
if (totalTokens <= maxTokens) {
return messages;
}
// 2. Identify overflow
const overflow = totalTokens - maxTokens;
// 3. Preserve system and recent messages
const systemMsg = messages[0].role === 'system' ? [messages[0]] : [];
const recentMsgs = messages.slice(-keepRecent);
const middleMsgs = messages.slice(
systemMsg.length,
messages.length - keepRecent
);
// 4. Compress middle messages
const compressedMiddle = await compressMessages(middleMsgs, overflow);
return [...systemMsg, ...compressedMiddle, ...recentMsgs];
}
async function compressMessages(messages: Message[], overflowTokens: number): Promise<Message[]> {
// Strategy: Summarize oldest chunks until budget is met
// In production, use a chunking algorithm based on semantic boundaries
if (messages.length < 2) return messages;
// Group into pairs and summarize
const chunks: Message[] = [];
for (let i = 0; i < messages.length; i += 2) {
const chunk = messages.slice(i, i + 2);
const combinedContent = chunk.map(m => `${m.role}: ${m.content}`).join('\n');
// Only summarize if we have significant overflow
if (overflowTokens > 50) {
try {
const res = await openai.chat.completions.create({
model: 'gpt-4o-mini-2024-07-18',
messages: [
{ role: 'system', content: 'Summarize the following conversation concisely. Preserve all facts and entities.' },
{ role: 'user', content: combinedContent }
],
temperature: 0,
});
chunks.push({
role: 'assistant',
content: `[SUMMARY] ${res.choices[0].message.content}`,
});
// Update overflow estimate
const summaryTokens = Math.ceil((res.choices[0].message.content?.length || 0) / 4);
const originalTokens = chunk.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);
overflowTokens -= (originalTokens - summaryTokens);
} catch (e) {
// Fallback: keep original if summarization fails
chunks.push(...chunk);
}
} else {
chunks.push(...chunk);
}
}
// Recursively prune if still over budget
const newTotal = chunks.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);
if (newTotal > 4000) { // Safety valve
return compressMessages(chunks, overflowTokens);
}
return chunks;
}
3. Cost-Aware Retry Budget
Retries should not be infinite. We implement a RetryBudget that tracks token spend on retries. If the budget is exhausted, the system degrades gracefully (e.g., returns a cached response or a simplified fallback) rather than burning more tokens on a failing request.
// retry-budget.ts
import { OpenAI } from 'openai';
export interface RetryConfig {
maxRetries: number;
tokenBudget: number; // Max tokens allowed for retries
backoffBase: number; // ms
}
const DEFAULT_CONFIG: RetryConfig = {
maxRetries: 3,
tokenBudget: 2000,
backoffBase: 1000,
};
export async function callWithRetryBudget<T>(
fn: () => Promise<T>,
config: RetryConfig = DEFAULT_CONFIG
): Promise<T> {
let retries = 0;
let tokensSpent = 0;
while (retries <= config.maxRetries) {
try {
return await fn();
} catch (error) {
if (!(error instanceof OpenAI.APIError)) throw error;
// Only retry on specific errors
const isRetryable = error.status === 429 || error.status === 500 || error.status === 503;
if (!isRetryable) throw error;
retries++;
// Estimate tokens consumed by the failed attempt
// In streaming, this requires tracking usage; here we estimate
const estimatedCost = 500;
tokensSpent += estimatedCost;
if (tokensSpent > config.tokenBudget) {
console.warn(`[RetryBudget] Exhausted. Spent ${tokensSpent} tokens on retries.`);
throw new Error(`Retry budget exhausted after ${retries} attempts.`);
}
const delay = config.backoffBase * Math.pow(2, retries - 1) + Math.random() * 1000;
console.warn(`[Retry] Attempt ${retries} failed. Retrying in ${delay}ms. Budget: ${config.tokenBudget - tokensSpent} tokens left.`);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Max retries exceeded');
}
Pitfall Guide
These are the failures we debugged in production. Memorize these patterns.
1. The "Phantom" Memory Leak in Streaming
Error:FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memoryRoot Cause: We used stream.toReadableStream() but didn't consume the stream in the HTTP response handler during high concurrency. The stream buffers accumulated in memory.
Fix: Ensure the stream is piped directly to the response object. Never buffer the full stream in memory.
// BAD
const stream = await openai.chat.completions.create({ ..., stream: true });
const fullText = await stream.reduce(...); // OOM on large responses
// GOOD
res.setHeader('Content-Type', 'text/event-stream');
for await (const chunk of stream) {
res.write(formatSSE(chunk));
}
res.end();
2. Vector Index Corruption
Error:RedisError: Index already exists with different schemaRoot Cause: We updated the embedding model from text-embedding-ada-002 (1536 dims) to text-embedding-3-small (1536 dims, but different distribution). The Redis index schema didn't change, but the vector distribution shifted, causing poor recall. Worse, a deployment script tried to recreate the index without dropping it first.
Fix: Implement index versioning. When changing models, drop and recreate the index or use a new index name with a migration strategy.
Action: Add DROP INDEX IF EXISTS to your initialization script, or use FT.CREATE ... IF NOT EXISTS with schema validation.
3. Context Window Explosion via Tool Calls
Error:400 - context_length_exceeded: Maximum context length is 128000 tokens, but you requested 132400 tokensRoot Cause: Our agent used tools. The tool response included a massive JSON blob (e.g., a full database dump). We appended this to history without pruning. The context grew exponentially with each tool use.
Fix: Enforce a maxToolResponseLength. Truncate tool outputs aggressively.
// Enforce limit on tool outputs
if (toolResponse.length > 2000) {
toolResponse = toolResponse.substring(0, 2000) + '... [TRUNCATED]';
}
4. Coalescing Map Leak
Error:Memory leak detected. Coalescing map size > 10,000Root Cause: The coalescingMap in the cache code only cleaned up on success. If the LLM call threw an error, the Promise remained in the map forever, blocking future requests.
Fix: Use promise.finally() to guarantee cleanup.
Cache hit ratio is high, but vector search is slow
Check Redis FT.SEARCH latency. Ensure index has ALGORITHM FLAT or HNSW with correct EF_RUNTIME.
Cost spikes, stable latency
Semantic threshold too high
Lower SEMANTIC_THRESHOLD from 0.95 to 0.92. Review embedding quality.
429 errors increasing
Retry budget too aggressive
Reduce maxRetries or implement circuit breaker.
Wrong answers in cache
Embedding model mismatch
Verify embedding model matches index schema. Check for prompt drift.
Production Bundle
Performance Metrics
After deploying this architecture to production:
API Spend: Reduced from $18,400/month to $6,992/month (62% savings).
P99 Latency: Reduced from 980ms to 530ms.
Cache Hit Ratio: 41% of requests served from semantic cache.
Coalescing Efficiency: 12% of requests merged, saving ~3,500 redundant calls/day.
Context Errors: Dropped to 0 per month.
Monitoring Setup
We use Prometheus and Grafana. Essential metrics:
llm_semantic_cache_hits_total vs llm_semantic_cache_misses_total.
llm_coalesced_requests_total.
llm_tokens_consumed_total (labeled by model and pruned).
llm_retry_budget_exhausted_total.
redis_vector_search_duration_seconds.
Grafana Dashboard Alert:
Alert if llm_semantic_cache_hit_ratio < 0.3 for 15 minutes. Indicates embedding drift or threshold misconfiguration.
Alert if llm_tokens_consumed_total increases by 20% hour-over-hour. Detects context bloat or prompt injection attacks.
Scaling Considerations
Redis: Use Redis 7.4 with RediSearch and HNSW indexing. For >1M vectors, provision a cluster with EF_CONSTRUCTION=200 and EF_RUNTIME=50.
Node.js: Run in cluster mode. The coalescingMap is in-memory, so coalescing only works per-instance. For global coalescing, implement a Redis-backed lock with a 200ms TTL.
Embeddings: Batch embedding requests. The OpenAI API allows up to 2048 inputs per batch. This reduces embedding latency by 10x and cost by 50%.
Tune Thresholds: Adjust SEMANTIC_THRESHOLD based on cache hit ratio and answer quality feedback.
This pattern is battle-tested. It moves beyond naive caching and addresses the economic and latency realities of LLM APIs in production. Implement this, and you stop paying for waste.
🎉 Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all 635+ tutorials.