sand documents remains competitive because structured output LLMs require fewer tokens than free-text generation, and modern providers optimize JSON schema enforcement efficiently.
Core Solution
Implementing AI-powered data extraction requires a disciplined architecture that treats LLMs as constrained generators, not open-ended chatbots. The pipeline must enforce schema validation, handle OCR noise, manage token limits, and provide graceful degradation.
Use zod to declare the expected output structure. Zod provides runtime validation, type inference, and seamless conversion to JSON Schema for LLM constraint enforcement.
import { z } from 'zod';
export const InvoiceSchema = z.object({
invoiceNumber: z.string().regex(/^\d{3,10}$/),
date: z.string().date(),
vendorName: z.string().min(2),
lineItems: z.array(z.object({
description: z.string(),
quantity: z.number().int().positive(),
unitPrice: z.number().nonnegative(),
total: z.number().nonnegative()
})),
subtotal: z.number().nonnegative(),
tax: z.number().nonnegative(),
total: z.number().nonnegative()
});
export type Invoice = z.infer<typeof InvoiceSchema>;
Step 2: Document Preprocessing
Raw OCR output contains noise: page numbers, headers, footers, and layout artifacts. Clean the text before passing it to the LLM. Remove repeated patterns, normalize whitespace, and preserve tabular structure where possible.
import { createWorker } from 'tesseract.js';
import { createReadStream } from 'fs';
import { pipeline } from 'stream/promises';
async function extractTextFromPdf(pdfPath: string): Promise<string> {
const worker = await createWorker('eng');
const { data: { text } } = await worker.recognize(pdfPath);
await worker.terminate();
// Normalize noise
return text
.replace(/\r\n/g, '\n')
.replace(/\n{3,}/g, '\n\n')
.replace(/^(Page \d+ of \d+|Confidential|Internal Use)$/gim, '')
.trim();
}
Modern LLM providers support native JSON schema enforcement. This eliminates post-generation parsing errors and guarantees output shape compliance.
import OpenAI from 'openai';
import { zodToJsonSchema } from 'zod-to-json-schema';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function extractInvoiceData(
rawText: string,
schema: z.ZodType<any>
): Promise<Invoice> {
const jsonSchema = zodToJsonSchema(schema, 'InvoiceSchema');
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: 'Extract structured data from the provided document text. Return only valid JSON matching the specified schema. Do not include explanations or markdown formatting.'
},
{
role: 'user',
content: rawText
}
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'InvoiceSchema',
schema: jsonSchema,
strict: true
}
},
temperature: 0.1,
max_tokens: 2048
});
const extracted = JSON.parse(response.choices[0].message.content || '{}');
return schema.parse(extracted); // Runtime validation
}
Step 4: Validation & Fallback Strategy
LLMs can hallucinate or return malformed data under noise. Implement a validation layer with confidence scoring and fallback routing.
import { z } from 'zod';
export async function safeExtract<T>(
rawText: string,
schema: z.ZodType<T>,
maxRetries = 2
): Promise<{ data: T; confidence: number; fallbackTriggered: boolean }> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const data = await extractInvoiceData(rawText, schema);
return { data, confidence: 1.0, fallbackTriggered: false };
} catch (error) {
if (attempt === maxRetries) {
// Fallback: route to human review or rule-based parser
return {
data: {} as T,
confidence: 0.0,
fallbackTriggered: true
};
}
// Exponential backoff
await new Promise(res => setTimeout(res, Math.pow(2, attempt) * 1000));
}
}
throw new Error('Extraction failed after max retries');
}
Step 5: Async Queue Architecture
Synchronous extraction blocks request threads and degrades throughput. Use a message queue (BullMQ, SQS, RabbitMQ) to decouple ingestion from processing.
import Queue from 'bull';
const extractionQueue = new Queue('document-extraction', {
redis: { host: process.env.REDIS_HOST, port: 6379 }
});
extractionQueue.process('extract-invoice', 10, async (job) => {
const { pdfPath, schema } = job.data;
const rawText = await extractTextFromPdf(pdfPath);
const result = await safeExtract(rawText, schema);
if (result.fallbackTriggered) {
await queueForHumanReview(pdfPath, result);
} else {
await persistExtractedData(result.data);
}
return result;
});
// Enqueue
await extractionQueue.add('extract-invoice', { pdfPath: '/tmp/invoice.pdf', schema: InvoiceSchema });
Architecture Decisions & Rationale:
- Zod + JSON Schema: Guarantees output shape compliance at both LLM generation and runtime validation layers. Eliminates custom parsing logic.
- Strict Mode in
response_format: Prevents LLMs from deviating from schema. Reduces token waste and parsing failures.
- Async Queue: Decouples I/O-bound OCR and LLM calls from application threads. Enables horizontal scaling and retry management.
- Fallback Routing: Preserves pipeline integrity. Low-confidence extractions route to human review or deterministic parsers, maintaining SLA compliance.
Pitfall Guide
1. Skipping Runtime Schema Validation
LLMs with structured output still occasionally produce schema violations due to token truncation or edge-case hallucinations. Relying solely on provider-level JSON schema enforcement leaves gaps. Always validate with a runtime type system like Zod or io-ts before persisting data.
2. Feeding Raw OCR Output Without Noise Reduction
OCR introduces artifacts: repeated headers, page markers, misaligned columns, and encoding errors. Passing uncleaned text degrades extraction accuracy by 12-18%. Implement a lightweight normalization step that removes predictable noise patterns while preserving semantic structure.
3. Ignoring Token Limits and Context Window Management
Long documents (contracts, multi-page invoices) exceed token limits when processed as single payloads. Truncation silently drops critical data. Chunk documents by logical sections (pages, tables, signatures), extract independently, then merge results. Use overlapping context windows to preserve cross-section references.
Blocking HTTP requests with LLM calls creates timeout cascades and poor user experience. Extraction must be async-first. Use job queues with concurrency limits, dead-letter queues for failures, and idempotent processing keys to prevent duplicate extractions.
5. Over-Prompting vs. Under-Constraining
Verbose system prompts increase token cost and introduce ambiguity. Conversely, minimal prompts reduce guidance. The optimal pattern: a 1-2 sentence system prompt defining the extraction task, strict JSON schema enforcement, and temperature ≤ 0.2. Remove all conversational filler.
6. No Confidence Scoring or Threshold Routing
AI extraction is probabilistic. Blindly accepting all outputs introduces silent data corruption. Implement confidence estimation (via LLM self-assessment, validation pass count, or embedding similarity to known patterns) and route low-confidence results to manual review queues.
7. Neglecting Data Privacy and Residency Compliance
Sending raw documents to third-party LLM providers may violate GDPR, HIPAA, or internal data governance policies. Implement PII redaction before extraction, use on-prem or VPC-hosted models for regulated data, and audit provider data retention policies. Never store raw documents longer than necessary.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume transactional forms (receipts, invoices) | LLM + Structured Output + Async Queue | Predictable schema, high accuracy, low maintenance | $0.008-0.015/doc |
| Complex legal/medical contracts | LLM + Chunking + Human Review Fallback | Multi-page context, high liability, requires audit trail | $0.025-0.040/doc + review labor |
| Real-time user input (mobile scans) | LLM + On-device OCR + Streaming | Latency sensitivity, offline capability, UX critical | Higher compute, lower cloud spend |
| Legacy template-heavy documents | Rule-based + OCR + LLM Validation | Stable layouts, deterministic parsing, cost optimization | $0.003-0.006/doc |
Configuration Template
// config/extraction.ts
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
import OpenAI from 'openai';
import Queue from 'bull';
export const ExtractionConfig = {
openai: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
queue: new Queue('ai-extraction', {
redis: { host: process.env.REDIS_HOST || '127.0.0.1', port: Number(process.env.REDIS_PORT) || 6379 },
defaultJobOptions: {
attempts: 3,
backoff: { type: 'exponential', delay: 2000 },
removeOnComplete: 100,
removeOnFail: 50
}
}),
thresholds: {
confidence: 0.85,
maxTokens: 4096,
temperature: 0.1
},
validation: {
strict: true,
fallback: 'human_review'
}
};
export function createExtractionSchema<T extends z.ZodRawShape>(shape: T) {
const schema = z.object(shape);
return {
schema,
jsonSchema: zodToJsonSchema(schema, 'ExtractionSchema'),
validate: (input: unknown) => schema.parse(input)
};
}
Quick Start Guide
- Install dependencies:
npm install zod zod-to-json-schema openai bull
- Define schema: Create a Zod schema matching your target document structure with strict type constraints.
- Initialize queue & client: Import
ExtractionConfig, configure Redis and OpenAI credentials, and register the extraction processor.
- Run extraction: Push documents to the queue with
queue.add('extract', { path, schema }). Monitor logs for success/failure rates and adjust confidence thresholds based on validation pass rates.