layer reduces waste while preserving semantic integrity.
Step 1: Tokenizer Alignment
Model-specific tokenizers must be used during development to match production behavior. Mismatched tokenizers cause silent budget overruns and context limit violations.
import { get_encoding, encoding_for_model } from "tiktoken";
export class TokenizerManager {
private encoder: any;
constructor(model: string) {
this.encoder = encoding_for_model(model as any);
}
count(text: string): number {
return this.encoder.encode(text).length;
}
truncate(text: string, maxTokens: number, suffix: string = "..."): string {
const tokens = this.encoder.encode(text);
if (tokens.length <= maxTokens) return text;
const truncated = this.encoder.decode(tokens.slice(0, maxTokens));
return `${truncated} ${suffix}`;
}
}
Step 2: Prompt Structuring & Compression
Raw text contains structural redundancy. Template-driven prompts with explicit delimiters, minimal examples, and compressed JSON schemas reduce token overhead by 30–50%.
export class PromptCompressor {
static compressContext(context: Record<string, any>): string {
// Remove null/undefined, collapse whitespace, minify JSON
const cleaned = Object.fromEntries(
Object.entries(context).filter(([, v]) => v !== null && v !== undefined)
);
return JSON.stringify(cleaned, null, 0);
}
static buildSystemPrompt(instructions: string[], examples: string[]): string {
const base = `You are a precision assistant. Follow these rules strictly:\n${instructions.join("\n- ")}`;
const formattedExamples = examples.length > 0
? `\n\nExamples:\n${examples.map((e, i) => `Example ${i + 1}:\n${e}`).join("\n---\n")}`
: "";
return `${base}${formattedExamples}`;
}
}
Step 3: Dynamic Context Window Management
Fixed context windows fail under variable workloads. A sliding window with semantic chunking preserves recent conversation state while pruning low-relevance historical tokens.
interface Message { role: "system" | "user" | "assistant"; content: string; tokens: number; }
export class ContextWindowManager {
private messages: Message[] = [];
private maxTokens: number;
private tokenizer: TokenizerManager;
constructor(maxTokens: number, model: string) {
this.maxTokens = maxTokens;
this.tokenizer = new TokenizerManager(model);
}
add(role: Message["role"], content: string): void {
const tokens = this.tokenizer.count(content);
this.messages.push({ role, content, tokens });
this.prune();
}
private prune(): void {
let total = this.messages.reduce((sum, m) => sum + m.tokens, 0);
while (total > this.maxTokens && this.messages.length > 1) {
const removed = this.messages.shift()!;
total -= removed.tokens;
}
}
getPayload(): Message[] {
return this.messages;
}
}
Step 4: Caching Architecture
Prompt caching and response caching eliminate redundant computation. Semantic hashes enable cache hits across paraphrased inputs.
import { createHash } from "crypto";
export class TokenCache {
private store: Map<string, { response: string; tokens: number; ttl: number }> = new Map();
private hash(prompt: string): string {
return createHash("sha256").update(prompt.normalize("NFKC")).digest("hex").slice(0, 16);
}
async getOrCompute(prompt: string, compute: () => Promise<{ response: string; tokens: number }>): Promise<{ response: string; tokens: number }> {
const key = this.hash(prompt);
const cached = this.store.get(key);
if (cached && cached.ttl > Date.now()) {
return { response: cached.response, tokens: cached.tokens };
}
const result = await compute();
this.store.set(key, { ...result, ttl: Date.now() + 3600000 }); // 1h TTL
return result;
}
}
Architecture Decisions & Rationale:
- Tokenizer alignment first: Prevents silent context overflow and ensures accurate budgeting.
- Compression over truncation: Preserves semantic boundaries; truncation breaks syntax and loses constraints.
- Sliding window with role weighting: System prompts are preserved; user/assistant turns are evicted by age and token weight.
- Semantic hashing for cache: Normalization and SHA-256 truncation balance collision resistance with lookup speed.
- Separation of concerns: Tokenizer, compressor, window manager, and cache operate independently, enabling unit testing and middleware composition.
Pitfall Guide
- Ignoring tokenizer variance: GPT-4, Claude, and Llama tokenize differently. Using a generic tokenizer during development causes production context limit violations. Always initialize tokenizers per target model.
- Over-compression losing constraints: Stripping too much structure removes critical instructions, boundaries, or format requirements. Compression must preserve delimiters, role tags, and output schemas.
- Caching without invalidation: Stale cached responses degrade accuracy when context or external data changes. Implement TTLs, versioned keys, and cache warming strategies.
- Hardcoding input limits without output reservation: LLMs require token space for generation. Allocating 100% of the window to input causes truncation mid-generation. Reserve 20–30% for output tokens.
- Treating all tokens equally: System prompts, retrieved context, and user messages carry different weights. Prioritize system instructions and recent turns; evict older, low-signal context first.
- Skipping token distribution monitoring: Without telemetry, optimization is guesswork. Log token counts per role, track compression ratios, and alert on variance spikes.
- Assuming larger context windows eliminate optimization: 128K+ windows encourage bloat, increase prefill latency, and raise costs. Optimization remains mandatory regardless of window size.
Best Practices from Production:
- Implement graceful degradation: fall back to compressed summaries when context exceeds thresholds.
- Use streaming to decouple latency from token count; users perceive faster responses even with larger payloads.
- Version prompt templates and cache keys together to prevent silent accuracy drift.
- Run load tests with realistic token distributions, not synthetic averages.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume customer support | Prompt caching + static templates | Identical queries repeat frequently; cache hits eliminate redundant inference | -70% API cost |
| RAG with large documents | Semantic chunking + sliding window | Preserves relevance while pruning low-signal sections; reduces prefill time | -45% latency |
| Multi-agent orchestration | Dynamic context + output reservation | Agents require state continuity; reserving output tokens prevents mid-generation truncation | -30% error rate |
| Real-time chat interfaces | Streaming + tokenizer alignment | Users perceive lower latency; alignment prevents context limit crashes | -25% infra scaling |
| Batch data processing | Semantic compression + batch token budgeting | Compresses redundant schemas; batch allocation maximizes throughput per request | -60% compute cost |
Configuration Template
// llm-config.ts
export interface LLMConfig {
model: string;
maxInputTokens: number;
outputReservation: number;
cacheTTL: number;
compressionLevel: "none" | "light" | "aggressive";
telemetry: boolean;
}
export const defaultConfig: LLMConfig = {
model: "gpt-4o",
maxInputTokens: 8000,
outputReservation: 2000,
cacheTTL: 3600000,
compressionLevel: "light",
telemetry: true,
};
// Usage in pipeline
import { TokenizerManager } from "./tokenizer";
import { PromptCompressor } from "./compressor";
import { ContextWindowManager } from "./context";
import { TokenCache } from "./cache";
export class LLMTokenOptimizer {
private tokenizer: TokenizerManager;
private window: ContextWindowManager;
private cache: TokenCache;
constructor(config: LLMConfig) {
this.tokenizer = new TokenizerManager(config.model);
this.window = new ContextWindowManager(config.maxInputTokens, config.model);
this.cache = new TokenCache();
}
async optimize(prompt: string, compute: () => Promise<string>): Promise<string> {
const compressed = PromptCompressor.compressContext({ prompt });
this.window.add("user", compressed);
const payload = this.window.getPayload();
const result = await this.cache.getOrCompute(
JSON.stringify(payload),
async () => {
const response = await compute();
return { response, tokens: this.tokenizer.count(response) };
}
);
return result.response;
}
}
Quick Start Guide
- Install dependencies:
npm install tiktoken @anthropic-ai/sdk crypto (or equivalent for your model provider).
- Initialize the optimizer: Import
LLMTokenOptimizer and pass your target model and token budget.
- Wrap your LLM call: Replace direct API invocations with
optimizer.optimize(prompt, () => yourModel.generate(prompt)).
- Enable telemetry: Log
inputTokens, outputTokens, and cacheHit on each request to validate optimization impact.
- Deploy and monitor: Run a 24-hour shadow test comparing baseline vs. optimized token usage; adjust
compressionLevel and outputReservation based on observed variance.