mbedding model (e.g., text-embedding-3-small, nomic-embed-text, or @xenova/transformers) to generate fixed-length representations. Store vectors alongside the original prompt and response payload.
Semantic matching requires a cosine similarity threshold. Values below 0.85 typically indicate distinct intents. TTL must be dynamic: cache entries expire faster when underlying data changes or when token budgets are exhausted. Use a hybrid TTL strategy combining absolute expiration and usage-based decay.
Step 3: Implement Response Optimization
Optimization occurs at three points:
- Prompt compression: Remove redundant system instructions, truncate non-essential context, and apply template normalization before embedding.
- Token budgeting: Cap output tokens and truncate responses when they exceed cost thresholds.
- Streaming passthrough: When cache misses occur, stream the LLM response while simultaneously writing to cache. Subsequent identical requests receive the cached stream buffer.
Step 4: Prevent Cache Stampede
Concurrent identical requests during a cache miss cause thundering herd behavior. Implement request coalescing using a distributed lock or promise deduplication. Only one request triggers the LLM; others await the resolved promise.
TypeScript Implementation
import { createClient, RedisClientType } from 'redis';
import { cosineSimilarity } from './utils/similarity';
import { generateEmbedding } from './services/embedding';
import { compressPrompt, tokenize } from './services/tokenizer';
interface CacheEntry {
prompt: string;
response: string;
embedding: number[];
tokens: number;
createdAt: number;
ttl: number;
hitCount: number;
}
export class AICacheOptimizer {
private redis: RedisClientType;
private similarityThreshold: number;
private maxTokens: number;
private baseTTL: number;
constructor(config: {
redisUrl: string;
similarityThreshold?: number;
maxTokens?: number;
baseTTL?: number;
}) {
this.redis = createClient({ url: config.redisUrl });
this.similarityThreshold = config.similarityThreshold ?? 0.85;
this.maxTokens = config.maxTokens ?? 4000;
this.baseTTL = config.baseTTL ?? 3600;
}
async init() {
await this.redis.connect();
}
async getOrGenerate(prompt: string): Promise<string> {
const normalized = compressPrompt(prompt);
const embedding = await generateEmbedding(normalized);
const cacheKey = `ai:cache:${this.hash(normalized)}`;
// Request coalescing to prevent stampede
const lockKey = `ai:lock:${this.hash(normalized)}`;
const lockAcquired = await this.redis.set(lockKey, '1', { NX: true, EX: 5 });
if (!lockAcquired) {
// Wait for concurrent request to resolve
return this.waitForCache(cacheKey, 3000);
}
try {
const cached = await this.findSemanticMatch(embedding);
if (cached) {
await this.redis.hIncrBy(cacheKey, 'hitCount', 1);
return cached.response;
}
// Fallback to LLM generation
const response = await this.generateResponse(normalized);
const tokens = tokenize(response).length;
if (tokens > this.maxTokens) {
return response.slice(0, this.maxTokens * 4); // rough char approximation
}
const ttl = this.calculateDynamicTTL(tokens);
await this.redis.hSet(cacheKey, {
prompt: normalized,
response,
embedding: JSON.stringify(embedding),
tokens: String(tokens),
createdAt: String(Date.now()),
ttl: String(ttl),
hitCount: '0'
});
await this.redis.expire(cacheKey, ttl);
return response;
} finally {
await this.redis.del(lockKey);
}
}
private async findSemanticMatch(queryEmbedding: number[]): Promise<CacheEntry | null> {
const keys = await this.redis.keys('ai:cache:*');
let bestMatch: CacheEntry | null = null;
let bestScore = -1;
for (const key of keys) {
const raw = await this.redis.hGetAll(key);
if (!raw.embedding) continue;
const storedEmbedding = JSON.parse(raw.embedding);
const score = cosineSimilarity(queryEmbedding, storedEmbedding);
if (score >= this.similarityThreshold && score > bestScore) {
bestScore = score;
bestMatch = {
prompt: raw.prompt,
response: raw.response,
embedding: storedEmbedding,
tokens: Number(raw.tokens),
createdAt: Number(raw.createdAt),
ttl: Number(raw.ttl),
hitCount: Number(raw.hitCount)
};
}
}
return bestMatch;
}
private calculateDynamicTTL(tokens: number): number {
// Shorter TTL for high-token responses to reduce stale cache risk
const decay = Math.max(0.5, 1 - (tokens / this.maxTokens) * 0.3);
return Math.floor(this.baseTTL * decay);
}
private async generateResponse(prompt: string): Promise<string> {
// Replace with your LLM provider SDK
// Supports streaming internally, but returns aggregated string for cache storage
throw new Error('Implement LLM generation');
}
private async waitForCache(key: string, timeout: number): Promise<string> {
const start = Date.now();
while (Date.now() - start < timeout) {
const exists = await this.redis.exists(key);
if (exists) {
const raw = await this.redis.hGetAll(key);
return raw.response;
}
await new Promise(r => setTimeout(r, 100));
}
throw new Error('Cache wait timeout');
}
private hash(input: string): string {
return Buffer.from(input).toString('base64').replace(/[^a-zA-Z0-9]/g, '').slice(0, 16);
}
}
Architecture Decisions & Rationale
- Semantic matching over exact keys: LLM prompts vary syntactically but converge semantically. Cosine similarity on embeddings captures intent equivalence without manual regex or prompt normalization.
- Dynamic TTL tied to token count: High-token responses consume more cache memory and age faster. Reducing TTL proportionally limits stale data exposure.
- Request coalescing via distributed locks: Prevents redundant LLM calls during cache misses. Promise deduplication ensures only one inference triggers per semantic cluster.
- Cache-aside with streaming passthrough: The cache stores aggregated responses, but production gateways should stream directly to clients while buffering for cache writes. This decouples latency from cache write latency.
Pitfall Guide
-
Relying exclusively on exact-match caching
LLM prompts are naturally paraphrased. Exact-match caches achieve <25% hit rates in production. Semantic vectors or fuzzy hashing must replace string equality.
-
Ignoring context drift and temporal data
Cached responses become stale when underlying facts change (pricing, policies, system states). Implement versioned cache keys or attach data freshness metadata to cache entries.
-
Caching system-state-dependent prompts
Prompts containing user IDs, session tokens, or real-time metrics should never be cached. Filter dynamic segments before embedding generation.
-
Cache stampede during peak loads
Without request coalescing, 100 concurrent identical requests trigger 100 LLM calls. Distributed locks or in-memory promise deduplication are mandatory.
-
Neglecting streaming optimization
Caching aggregated responses breaks streaming UX. Implement a dual-path architecture: stream directly to the client while writing to cache asynchronously. Subsequent requests receive the cached buffer.
-
Static TTL without usage analytics
Fixed expiration ignores traffic patterns. Cache entries with high hit counts should receive TTL extensions; low-traffic entries should expire faster to reclaim memory.
-
Arbitrary similarity thresholds
A threshold of 0.75 may match unrelated prompts; 0.95 may miss valid duplicates. Calibrate thresholds using a validation set of known duplicate prompts and measure precision/recall tradeoffs.
Best Practices from Production
- Run a cache warming job during off-peak hours to pre-populate high-frequency semantic clusters.
- Monitor cache hit rate, P95 latency, and token expenditure per 1k requests. Alert when hit rate drops below 60%.
- Use approximate nearest neighbor (ANN) indexes like HNSW or FAISS for vector search when cache size exceeds 10k entries.
- Implement cache invalidation webhooks for data source changes. Trigger semantic re-validation instead of blanket flushes.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume FAQ/chatbot | Semantic Vector Cache + ANN Index | Prompts are highly repetitive; vector search captures intent equivalence at scale. | Reduces API spend by 55β70% |
| Real-time data queries | Token-Aware Hybrid + Versioned Keys | Data freshness matters; semantic caching with TTL decay prevents stale outputs. | Moderate cost increase for validation, but avoids incorrect responses |
| Low-traffic internal tools | Exact-Match + Short TTL | Overhead of embedding generation outweighs benefits; simple caching suffices. | Minimal infrastructure cost, ~15β20% savings |
| Multi-turn conversational UI | Session-Scoped Cache + Prompt Compression | Context window grows with turns; compress history and cache turn-level responses. | Reduces context token waste by 40β50% |
Configuration Template
{
"aiCache": {
"redis": {
"url": "redis://localhost:6379",
"keyPrefix": "ai:cache:",
"maxEntries": 50000,
"evictionPolicy": "volatile-lru"
},
"semantic": {
"embeddingModel": "text-embedding-3-small",
"similarityThreshold": 0.85,
"indexType": "hnsw",
"m": 16,
"efConstruction": 200
},
"optimization": {
"maxOutputTokens": 4000,
"promptCompression": true,
"dynamicTTL": {
"baseSeconds": 3600,
"tokenDecayFactor": 0.3,
"minTTL": 300
},
"streamingPassthrough": true,
"coalesceTimeoutMs": 3000
},
"monitoring": {
"metricsEndpoint": "/metrics/ai-cache",
"alertThresholds": {
"hitRateMin": 0.60,
"p95LatencyMaxMs": 450,
"tokenSpendPer1kReqs": 1200000
}
}
}
}
Quick Start Guide
- Initialize Redis & Embedding Service: Deploy a Redis instance and configure an embedding provider. Set environment variables for
REDIS_URL and EMBEDDING_API_KEY.
- Instantiate the Optimizer: Import
AICacheOptimizer, pass configuration, and call init(). Route all LLM calls through getOrGenerate(prompt).
- Add Telemetry: Attach Prometheus/Grafana metrics to cache hit rate, latency percentiles, and token expenditure. Configure alerts for hit rate drops below 60%.
- Validate Thresholds: Run a batch of 500 historical prompts through the optimizer. Adjust
similarityThreshold until false positives stay below 5%. Deploy to production.