requires a deterministic pipeline, not a probabilistic prompt. The following architecture implements a MapReduce summarization engine with semantic chunking, parallel processing, schema validation, and fallback routing.
Step 1: Semantic Chunking
Fixed-character splitting destroys technical context. Code, markdown, and structured logs require boundary-aware segmentation. Use a recursive splitter that respects markdown headers, code fences, and paragraph breaks.
interface Chunk {
id: string;
content: string;
metadata: { heading: string; type: 'code' | 'text' | 'log' };
}
function semanticChunk(text: string, maxTokens: number = 2000): Chunk[] {
const chunks: Chunk[] = [];
const segments = text.split(/(?=^#{1,6}\s|```|\n\n)/m);
let current = '';
let currentHeading = 'Root';
for (const segment of segments) {
const isHeading = /^#{1,6}\s/.test(segment);
if (isHeading) {
currentHeading = segment.trim();
}
const estimatedTokens = estimateTokens(current + segment);
if (estimatedTokens > maxTokens && current.trim()) {
chunks.push({
id: crypto.randomUUID(),
content: current.trim(),
metadata: { heading: currentHeading, type: detectType(current) }
});
current = segment;
} else {
current += segment;
}
}
if (current.trim()) {
chunks.push({
id: crypto.randomUUID(),
content: current.trim(),
metadata: { heading: currentHeading, type: detectType(current) }
});
}
return chunks;
}
Step 2: Parallel Map Phase
Process chunks concurrently. Inject cross-chunk context hints to preserve global coherence without bloating individual prompts.
import { OpenAI } from 'openai';
import { z } from 'zod';
const SummarySchema = z.object({
key_points: z.array(z.string()).min(1).max(5),
technical_decisions: z.array(z.string()).optional(),
risks_or_caveats: z.array(z.string()).optional(),
action_items: z.array(z.string()).optional()
});
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function summarizeChunk(chunk: Chunk, globalContext: string): Promise<z.infer<typeof SummarySchema>> {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: `You are a technical summarization engine. Extract structured insights. Never invent APIs, parameters, or logic not present in the source. Output must match the provided schema.`
},
{
role: 'user',
content: `Global Context: ${globalContext}\n\nChunk Heading: ${chunk.metadata.heading}\nContent:\n${chunk.content}`
}
],
response_format: { type: 'json_schema', schema: SummarySchema },
temperature: 0.1,
max_tokens: 500
});
const parsed = SummarySchema.safeParse(JSON.parse(response.choices[0].message.content || '{}'));
if (!parsed.success) {
throw new Error(`Schema validation failed: ${parsed.error.message}`);
}
return parsed.data;
}
Step 3: Reduce Phase
Merge partial summaries. The reduce function must de-duplicate, resolve contradictions, and enforce a final output schema.
async function reduceSummaries(partialSummaries: z.infer<typeof SummarySchema>[]): Promise<z.infer<typeof SummarySchema>> {
const merged = {
key_points: [...new Set(partialSummaries.flatMap(s => s.key_points))],
technical_decisions: [...new Set(partialSummaries.flatMap(s => s.technical_decisions || []))],
risks_or_caveats: [...new Set(partialSummaries.flatMap(s => s.risks_or_caveats || []))],
action_items: [...new Set(partialSummaries.flatMap(s => s.action_items || []))]
};
// Second-pass refinement for coherence and contradiction resolution
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'Consolidate technical summaries. Remove duplicates. Resolve contradictions. Preserve factual accuracy.' },
{ role: 'user', content: JSON.stringify(merged, null, 2) }
],
response_format: { type: 'json_schema', schema: SummarySchema },
temperature: 0.0,
max_tokens: 600
});
const final = SummarySchema.safeParse(JSON.parse(response.choices[0].message.content || '{}'));
if (!final.success) throw new Error(`Reduce validation failed: ${final.error.message}`);
return final.data;
}
Step 4: Orchestration & Fallback
Combine phases with concurrency limits, caching, and fallback routing for latency-critical paths.
export async function summarizeDocument(text: string, globalContext: string = '') {
const chunks = semanticChunk(text, 2000);
// Parallel processing with concurrency control
const concurrency = 5;
const partialSummaries: z.infer<typeof SummarySchema>[] = [];
for (let i = 0; i < chunks.length; i += concurrency) {
const batch = chunks.slice(i, i + concurrency);
const results = await Promise.allSettled(
batch.map(chunk => summarizeChunk(chunk, globalContext))
);
results.forEach(r => {
if (r.status === 'fulfilled') partialSummaries.push(r.value);
});
}
if (partialSummaries.length === 0) {
throw new Error('No valid summaries generated');
}
return reduceSummaries(partialSummaries);
}
Architecture Decisions & Rationale
- MapReduce over sliding window: Sliding windows create redundant token consumption and positional drift. MapReduce isolates context boundaries, enabling parallelism and deterministic scaling.
- Schema-enforced output: JSON schema validation eliminates structural hallucination. It forces the model to conform to engineering expectations (arrays, enums, bounded lengths).
- Low temperature + zero creativity: Summarization is extraction and compression, not generation. Temperature β€ 0.1 minimizes variance. Top-p is omitted to prevent tail-sampling artifacts.
- Fallback routing: If latency SLA is <500ms, route to extractive summarization (TF-IDF + sentence scoring) or cached semantic hashes. LLM fallback only triggers when accuracy thresholds drop below BERTScore 0.82.
Pitfall Guide
1. Fixed-Character Chunking on Technical Content
Splitting at arbitrary token or character boundaries severs code blocks, markdown tables, and log entries. The model receives syntactically invalid fragments, triggering hallucination or silent omission.
Fix: Use recursive semantic splitters that respect markdown headers, code fences, and paragraph boundaries. Validate chunk integrity before processing.
2. Ignoring Cross-Chunk Context
Processing chunks in isolation loses global architecture, naming conventions, and system boundaries. The reduce phase cannot reconstruct what was never preserved.
Fix: Inject a lightweight global context header (project name, tech stack, core modules) into every chunk prompt. Use document embeddings to retrieve relevant cross-references when chunking long RFCs.
3. No Evaluation Pipeline
Assuming "reads well" equals "is accurate" guarantees production failures. LLMs optimize for fluency, not fidelity.
Fix: Implement automated BERTScore/ROUGE-L monitoring on a golden dataset. Set circuit breakers: if F1 drops below 0.82, route to human review or fallback summarizer. Log hallucination patterns for prompt refinement.
4. Unbounded Token Consumption
Naive pipelines scale linearly with document length but exponentially with cost when retry logic, verbose prompts, or missing max_tokens constraints are present.
Fix: Enforce strict max_tokens per phase. Use streaming for UI, but batch for backend. Cache semantic hashes of identical documents to skip redundant LLM calls.
5. Prompt Injection via Source Content
Technical documents often contain markdown, code, or log data that mimics prompt syntax. Unsanitized input can override system instructions.
Fix: Wrap source content in explicit delimiters. Strip or escape XML-like tags, markdown directives, and control characters before injection. Validate output against schema before trusting downstream systems.
6. Over-Optimizing for Accuracy at the Expense of Latency
Hierarchical agentic workflows achieve +3% accuracy but add 2β3x latency. For real-time PR reviews or chat integrations, this breaks UX.
Fix: Implement tiered routing. Use MapReduce for async documentation. Use extractive or cached summaries for interactive flows. Define latency budgets per use case.
7. Caching Identical Prompts, Not Semantics
Hashing the raw prompt string misses near-duplicates. Slightly rephrased RFCs or updated logs trigger redundant LLM calls.
Fix: Cache at the semantic level. Generate embeddings for input chunks, use cosine similarity thresholds (β₯0.92), and return cached summaries with freshness timestamps.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time PR review (<500ms SLA) | Extractive + Semantic Cache | LLM latency violates UX constraints; extractive preserves key diffs at near-zero cost | -$0.012 per request |
| Weekly RFC digestion (async) | MapReduce LLM Pipeline | Balances accuracy (0.89 F1) with parallel throughput; handles 50k+ tokens reliably | +$0.032 per 10k tokens |
| Legacy codebase migration docs | Hierarchical Agentic | Requires multi-pass reasoning to resolve deprecated patterns and cross-module dependencies | +$0.061 per 10k tokens |
| Compliance/Legal audit logs | Rule-Based + LLM Verification | Zero hallucination tolerance; LLM only validates extracted entities against schema | -$0.008 per request |
Configuration Template
// summarizer.config.ts
export const summarizerConfig = {
chunking: {
maxTokens: 2000,
overlapTokens: 150,
respectBoundaries: ['markdown', 'code', 'log'],
minChunkSize: 100
},
llm: {
model: 'gpt-4o-mini',
temperature: 0.1,
maxTokens: 500,
responseFormat: 'json_schema',
concurrency: 5,
timeoutMs: 8000
},
evaluation: {
bertscoreThreshold: 0.82,
fallbackTrigger: 'accuracy_drop',
goldenDatasetPath: './data/golden_summaries.json'
},
caching: {
strategy: 'semantic',
embeddingModel: 'text-embedding-3-small',
similarityThreshold: 0.92,
ttlHours: 168,
provider: 'redis'
},
fallback: {
latencySLA: 500,
strategy: 'extractive_tfidf',
enabled: true
}
};
Quick Start Guide
- Install dependencies:
npm install openai zod @anthropic-ai/sdk redis ioredis
- Configure environment variables:
OPENAI_API_KEY, REDIS_URL, GOLDEN_DATASET_PATH
- Initialize the pipeline: Import
summarizeDocument from the core solution, pass raw text and optional global context.
- Wire caching: Connect Redis semantic cache with cosine similarity threshold. Set TTL to 168 hours for documentation.
- Deploy with monitoring: Attach BERTScore evaluation to every LLM response. Route to fallback if F1 < 0.82. Validate latency p95 stays within SLA.