osition is cheaper than upgrading to larger models. It prevents KV cache overflow in batch serving, maintains consistent SLOs, and forces architectural discipline around token budgeting. The window is not a bucket; it is a computational constraint that must be engineered.
Core Solution
Effective context window management requires a deterministic pipeline that budgets, filters, and composes tokens before model invocation. The architecture decouples token accounting from inference, enforces hard limits, and prioritizes high-signal content.
Step 1: Token Budget Calculation
Reserve tokens for system instructions, user input, tool outputs, and expected response length. Never allocate 100% of the window. Leave a 15β20% safety margin for tokenizer variance and output generation.
interface TokenBudget {
maxWindow: number;
systemTokens: number;
reservedOutput: number;
safetyMargin: number;
get available(): number;
}
const createBudget = (maxWindow: number): TokenBudget => ({
maxWindow,
systemTokens: 450,
reservedOutput: 1024,
safetyMargin: 0.15,
get available() {
return Math.floor(
this.maxWindow -
this.systemTokens -
this.reservedOutput -
(this.maxWindow * this.safetyMargin)
);
}
});
Step 2: Semantic Chunking
Split documents at logical boundaries (headings, code blocks, paragraph clusters). Maintain overlap to preserve cross-chunk context. Chunk size should align with embedding model limits and task granularity.
interface Chunk {
id: string;
content: string;
tokens: number;
metadata: Record<string, any>;
}
const splitIntoChunks = (
text: string,
maxChunkTokens: number = 512,
overlap: number = 50
): Chunk[] => {
// Pseudo-implementation: split by markdown headings or code fences
// Calculate tokens per chunk using tiktoken or model-specific tokenizer
// Return array of Chunk objects with metadata
return [];
};
Step 3: Relevance Scoring & Priority Queue
Score chunks against the current query or conversation state. Use embedding similarity, recency weighting, or task-specific heuristics. Inject scored chunks into a priority queue that evicts lowest-relevance tokens when the budget is exceeded.
class ContextComposer {
private queue: Array<{ chunk: Chunk; score: number }> = [];
add(chunk: Chunk, score: number) {
this.queue.push({ chunk, score });
this.queue.sort((a, b) => b.score - a.score);
}
compose(budget: number): string {
let used = 0;
const selected: string[] = [];
for (const { chunk } of this.queue) {
if (used + chunk.tokens > budget) break;
selected.push(chunk.content);
used += chunk.tokens;
}
return selected.join('\n\n');
}
}
Step 4: Dynamic Window Allocation & Fallback Truncation
When hard limits are reached, apply hierarchical truncation: preserve system prompt, retain highest-priority chunks, then truncate remaining content from the middle outward. Never truncate the final user query or tool responses.
Architecture Decisions & Rationale
- Stateless Composition: Context builders run outside the inference loop. This prevents KV cache bloat and allows horizontal scaling of the composition layer independently of GPU instances.
- Tokenizer Abstraction: Model-specific tokenizers (cl100k, o200k, llama3, etc.) must be abstracted. Token counts vary by 12β18% across models. Direct character-to-token conversion introduces budget drift.
- Priority Eviction over Fixed Windows: Fixed windows discard historically relevant context. Priority queues preserve semantic density regardless of chronological position, aligning with attention mechanisms that weight relevance over recency.
- Output Reservation: Failing to reserve output tokens causes mid-generation truncation or forced stopping. Reserving 10β15% of the window guarantees complete responses.
Pitfall Guide
-
Ignoring System Prompt Tokens
System instructions consume 300β800 tokens depending on complexity. Teams that calculate budgets against user-only tokens exceed window limits during runtime. Always include system, developer, and tool definitions in budget calculations.
-
Naive Character Truncation
Splitting strings at arbitrary character counts breaks tokens, corrupts JSON, and severs semantic boundaries. Always tokenize before truncation. Use tokenizer-aware slicing that preserves whole tokens and logical delimiters.
-
Over-Compression Losing Semantics
Aggressive summarization or aggressive chunk merging strips conditional logic, edge cases, and parameter definitions. Compression should target redundancy, not information density. Preserve code blocks, configuration snippets, and explicit constraints verbatim.
-
Not Accounting for Output Tokens
Input-only budgeting causes silent truncation. The model stops generating mid-sentence when the combined input + output exceeds the hard limit. Reserve output tokens based on historical response length distributions, not best-case estimates.
-
Assuming Fixed Token-to-Word Ratios
The 1 token β 4 characters or 0.75 words rule is a rough heuristic. Multilingual text, code, JSON, and markdown break this ratio by 20β40%. Use model-specific tokenizers for production budgeting.
-
Caching Invalid Contexts
Context caches that store pre-composed prompts without versioning or tokenizer alignment serve stale or mismatched token counts. Cache keys must include model ID, tokenizer version, and prompt template hash.
-
Ignoring KV Cache Scaling
Context length directly impacts GPU memory during decoding. A 128K context consumes significantly more VRAM than a 32K context, even if only 10% is relevant. Optimize context composition to reduce cache pressure, not just token billing.
Production Best Practices:
- Run pre-flight token validation before every inference call.
- Implement token budget alerts in observability pipelines.
- Use semantic chunking aligned with document structure, not fixed sizes.
- Maintain a tokenizer registry that routes counting to the correct model variant.
- Log actual vs. predicted token usage to calibrate budgeting heuristics.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time chat with short history | Fixed sliding window + recency weighting | Maintains conversational flow with minimal overhead | Low |
| Document Q&A / RAG pipelines | Semantic chunking + vector retrieval + priority queue | Injects only relevant fragments, reduces noise | Medium-High savings |
| Code review / long log analysis | Hierarchical chunking + tool-aware composition | Preserves code blocks and stack traces while trimming boilerplate | Medium savings |
| Multi-step agentic workflows | Dynamic budgeting + output reservation + fallback truncation | Prevents mid-generation failure across tool calls | High savings + SLO stability |
| High-throughput batch inference | Context compression + template caching | Reduces per-request compute and KV cache pressure | Highest cost reduction |
Configuration Template
// context-window.config.ts
import { createBudget } from './token-budget';
export interface ContextConfig {
model: string;
maxWindow: number;
chunking: {
maxTokens: number;
overlap: number;
strategy: 'semantic' | 'fixed' | 'markdown';
};
scoring: {
method: 'embedding' | 'recency' | 'hybrid';
weights: { relevance: number; recency: number };
};
composition: {
evictionPolicy: 'priority' | 'fifo' | 'lru';
preserveOutput: boolean;
outputReserveRatio: number;
};
fallback: {
truncateFrom: 'middle' | 'end';
preserveSystem: boolean;
};
}
export const defaultConfig: ContextConfig = {
model: 'anthropic/claude-sonnet-4-20250514',
maxWindow: 128000,
chunking: {
maxTokens: 512,
overlap: 64,
strategy: 'markdown'
},
scoring: {
method: 'hybrid',
weights: { relevance: 0.7, recency: 0.3 }
},
composition: {
evictionPolicy: 'priority',
preserveOutput: true,
outputReserveRatio: 0.12
},
fallback: {
truncateFrom: 'middle',
preserveSystem: true
}
};
// Usage
const budget = createBudget(defaultConfig.maxWindow);
const reservedOutput = Math.floor(defaultConfig.maxWindow * defaultConfig.composition.outputReserveRatio);
budget.reservedOutput = reservedOutput;
Quick Start Guide
- Install tokenizer package:
npm install tiktoken (OpenAI) or npm install @anthropic-ai/sdk (Anthropic) for model-specific token counting.
- Define budget: Create a
TokenBudget instance with your model's max window, system prompt length, and 12β15% output reservation.
- Chunk and score: Split input documents using semantic boundaries, calculate token counts per chunk, and score against the current query or conversation state.
- Compose and validate: Inject scored chunks into a priority queue, compose until the budget is reached, run a final token validation check, then send to the model.