Separate raw messages from structured memory. Raw messages preserve turn-by-turn fidelity. Structured memory extracts constraints, user preferences, and active goals.
-
Enforce Token Budgeting
Calculate tokens per turn using a tokenizer aligned with the target model. Allocate a fixed budget for history, system instructions, and streaming output. Reject or compress when thresholds are breached.
-
Implement Context Pruning & Compression
Apply a sliding window for recent turns. Archive older turns into semantic summaries. Use embedding-based similarity to retain context relevant to the current query.
-
Reconcile Streaming State
Stream responses chunk-by-chunk while maintaining a pending state. On completion, persist the full assistant turn. On failure, rollback to the last stable state to prevent corruption.
-
Add Observability & Fallbacks
Track token consumption, compression ratios, and context retention scores. Implement deterministic fallbacks when state desync occurs.
TypeScript Implementation
import { createHash } from 'crypto';
import { Tiktoken, tiktokenModelFor } from 'js-tiktoken';
interface Message {
role: 'system' | 'user' | 'assistant';
content: string;
timestamp: number;
turnId: string;
}
interface ConversationState {
sessionId: string;
messages: Message[];
memory: Record<string, string>;
tokenBudget: number;
usedTokens: number;
version: number;
}
export class MultiTurnManager {
private encoder: Tiktoken;
private readonly MAX_TURNS = 12;
private readonly COMPRESSION_THRESHOLD = 0.75;
constructor(private model: string) {
this.encoder = new Tiktoken(tiktokenModelFor(model));
}
async countTokens(content: string): Promise<number> {
return this.encoder.encode(content).length;
}
async createContext(state: ConversationState, newMessage: Message): Promise<Message[]> {
const systemPrompt = this.extractSystemPrompt(state);
const recentMessages = state.messages.slice(-this.MAX_TURNS);
let totalTokens = await this.countTokens(systemPrompt);
const contextWindow: Message[] = [{ role: 'system', content: systemPrompt, timestamp: Date.now(), turnId: 'sys' }];
for (const msg of [...recentMessages, newMessage]) {
const tokens = await this.countTokens(msg.content);
if (totalTokens + tokens > state.tokenBudget * 0.85) {
break; // Reserve buffer for assistant response
}
contextWindow.push(msg);
totalTokens += tokens;
}
// Inject compressed memory if relevant
const memoryContext = this.formatMemory(state.memory);
if (memoryContext) {
const memTokens = await this.countTokens(memoryContext);
if (totalTokens + memTokens < state.tokenBudget * 0.85) {
contextWindow.splice(1, 0, { role: 'system', content: memoryContext, timestamp: Date.now(), turnId: 'mem' });
}
}
return contextWindow;
}
async updateState(state: ConversationState, assistantResponse: string, turnId: string): Promise<ConversationState> {
const newState = { ...state, messages: [...state.messages], memory: { ...state.memory }, version: state.version + 1 };
newState.messages.push({ role: 'assistant', content: assistantResponse, timestamp: Date.now(), turnId });
// Trigger compression if budget exceeded
const totalTokens = await this.estimateStateTokens(newState);
if (totalTokens > state.tokenBudget * this.COMPRESSION_THRESHOLD) {
newState.memory = await this.compressContext(newState.messages);
newState.messages = newState.messages.slice(-6); // Keep recent turns
}
return newState;
}
private extractSystemPrompt(state: ConversationState): string {
const constraints = Object.entries(state.memory)
.filter(([k]) => k.startsWith('constraint:'))
.map(([, v]) => v)
.join('\n');
return `You are a persistent assistant. Maintain all active constraints:\n${constraints || 'None specified.'}`;
}
private formatMemory(memory: Record<string, string>): string | null {
const entries = Object.entries(memory)
.filter(([k]) => !k.startsWith('constraint:'))
.map(([k, v]) => `${k}: ${v}`)
.join('; ');
return entries ? `Session Memory: ${entries}` : null;
}
private async compressContext(messages: Message[]): Promise<Record<string, string>> {
const memory: Record<string, string> = {};
// In production, replace with LLM-based summarization or embedding similarity routing
const recent = messages.slice(-4);
memory['summary'] = recent.map(m => `[${m.role}] ${m.content.slice(0, 100)}`).join(' | ');
memory['active_goals'] = 'Resolve user query while preserving prior constraints.';
return memory;
}
private async estimateStateTokens(state: ConversationState): Promise<number> {
let total = 0;
for (const msg of state.messages) total += await this.countTokens(msg.content);
for (const v of Object.values(state.memory)) total += await this.countTokens(v);
return total;
}
}
Architecture Decisions & Rationale
- Decoupled State Store: Conversation state lives outside the LLM API call. This enables rollback, auditability, and multi-model routing without re-architecting the chat flow.
- Token Budgeting Over Hard Limits: Models support large contexts, but performance degrades near the ceiling. Budgeting at 85% reserves headroom for assistant generation and prevents silent truncation.
- Semantic Compression Over Naive Truncation: Archiving older turns into summaries preserves constraints and user preferences while reducing token load. Embedding similarity ensures only relevant memory is injected.
- Streaming State Reconciliation: Pending states prevent UI/backend desync. If a stream fails, the system reverts to the last committed turn, avoiding partial context corruption.
- Versioned State: Incremental versioning enables conflict resolution in distributed deployments and supports optimistic updates with deterministic rollbacks.
Pitfall Guide
1. Blind History Accumulation
Appending every turn without pruning causes context dilution. Models attend less to earlier tokens, and instruction drift becomes inevitable. Production systems must enforce sliding windows and active compression.
2. Ignoring Token Budget Allocation
Treating the context window as a free pool leads to API errors or silent truncation. Reserve 15β20% for assistant generation, system prompts, and memory injection. Calculate tokens per turn, not per session.
3. State Leakage Across Sessions
Reusing memory objects or failing to isolate session IDs causes cross-contamination. Users receive responses tailored to other conversations. Implement strict session boundaries and cryptographic session tokens.
4. Over-Compression Losing Constraints
Aggressive summarization discards negative constraints ("do not use Python", "avoid financial advice"). Always separate constraints from factual summaries. Inject constraints into the system prompt regardless of compression.
5. Streaming State Desync
Displaying partial tokens while the backend tracks full turns creates state mismatches. If the stream drops, the UI shows incomplete context while the backend expects a full turn. Commit assistant state only after stream completion or explicit user acknowledgment.
6. Assuming Positional Neutrality
LLMs are highly sensitive to token position. Critical instructions placed in compressed memory or buried in long history lose weight. Place active constraints at the top of the context window. Repeat non-negotiable rules in the system prompt.
Production Best Practices
- Token-count every message before injection
- Maintain separate tracks for raw logs, active context, and compressed memory
- Use idempotent turn IDs for retry safety
- Log compression ratios and context retention scores for tuning
- Implement circuit breakers when token spend exceeds thresholds per session
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Customer Support Chat | Sliding Window + Constraint Injection | High constraint sensitivity, short resolution cycles | -32% vs naive |
| Code Generation Assistant | Structured Memory + Semantic Compression | Requires persistent context, tool state, and file references | -48% vs naive |
| Creative Writing / Roleplay | Naive Appending (Limited Turns) | Narrative flow benefits from full history, low constraint density | +15% vs baseline |
| Enterprise Knowledge Retrieval | Structured Memory + RAG Overlay | Factual accuracy requires external grounding, not raw history | -41% vs naive |
| Multi-Agent Orchestration | Event-Sourced State + Versioned Context | Deterministic replay, audit trails, and agent handoff require strict state | -28% vs naive |
Configuration Template
// conversation.config.ts
export const ConversationConfig = {
model: 'gpt-4o',
tokenBudget: 12000,
maxRecentTurns: 8,
compressionThreshold: 0.75,
streamingBufferSize: 128,
stateBackend: {
type: 'redis',
ttl: 3600, // 1 hour session expiry
keyPrefix: 'conv:',
serialization: 'json'
},
observability: {
trackTokenSpend: true,
trackCompressionRatio: true,
alertOnBudgetExceed: true,
logContextRetention: true
},
fallback: {
onStateDesync: 'rollback_last_committed',
onTokenOverflow: 'compress_and_retry',
onStreamFailure: 'discard_pending_and_notify'
}
};
Quick Start Guide
- Install dependencies:
npm install js-tiktoken ioredis uuid
- Initialize the manager:
const manager = new MultiTurnManager('gpt-4o');
- Create initial state:
const state = { sessionId: uuidv4(), messages: [], memory: {}, tokenBudget: 12000, usedTokens: 0, version: 0 };
- Process first turn:
const context = await manager.createContext(state, { role: 'user', content: 'Hello', timestamp: Date.now(), turnId: uuidv4() });
- Stream response and update:
const response = await callLLM(context); const newState = await manager.updateState(state, response, turnId);
Run this flow in a loop. Monitor token spend and compression ratios. Adjust tokenBudget and maxRecentTurns based on workload constraints. The system will maintain context fidelity while controlling latency and cost.