inistic routing pipeline with explicit confidence boundaries, structured output contracts, and continuous evaluation. The implementation below demonstrates a TypeScript architecture that balances speed, cost, and auditability.
Step 1: Define Classification Schema & Thresholds
Classification must be schema-constrained. Define label hierarchies, multi-label rules, and confidence thresholds before writing inference logic.
export interface ClassificationSchema {
labels: string[];
multiLabel: boolean;
confidenceThreshold: number;
fallbackThreshold: number;
}
export const DEFAULT_SCHEMA: ClassificationSchema = {
labels: ["technical", "marketing", "compliance", "support", "spam", "unknown"],
multiLabel: false,
confidenceThreshold: 0.85,
fallbackThreshold: 0.60,
};
Step 2: Embedding Generation & Caching
Semantic embeddings replace token-level matching. Use a consistent model version and cache embeddings for repeated content.
import { createHash } from "crypto";
import { OpenAIEmbeddings } from "@langchain/openai";
const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
batchSize: 64,
});
const embeddingCache = new Map<string, number[]>();
async function getEmbedding(text: string): Promise<number[]> {
const hash = createHash("sha256").update(text).digest("hex");
if (embeddingCache.has(hash)) return embeddingCache.get(hash)!;
const [vector] = await embeddings.embedDocuments([text]);
embeddingCache.set(hash, vector);
return vector;
}
Step 3: Lightweight Classifier Routing
A calibrated classifier handles high-confidence routing. This example uses cosine similarity against label centroids, but production systems should replace it with a trained logistic regression or ONNX model.
function cosineSimilarity(a: number[], b: number[]): number {
const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dot / (magA * magB);
}
interface LabelCentroid {
label: string;
vector: number[];
}
// Precomputed centroids from training data
const centroids: LabelCentroid[] = [
{ label: "technical", vector: Array(1536).fill(0) }, // placeholder
{ label: "marketing", vector: Array(1536).fill(0) },
{ label: "compliance", vector: Array(1536).fill(0) },
{ label: "support", vector: Array(1536).fill(0) },
{ label: "spam", vector: Array(1536).fill(0) },
];
function routeViaClassifier(embedding: number[]): { label: string; confidence: number } {
const scores = centroids.map(c => ({
label: c.label,
confidence: cosineSimilarity(embedding, c.vector),
}));
scores.sort((a, b) => b.confidence - a.confidence);
return scores[0];
}
Step 4: LLM Fallback with Structured Output
Low-confidence or ambiguous content triggers a constrained LLM call. Use JSON schema enforcement to guarantee deterministic parsing.
import { ChatOpenAI } from "@langchain/openai";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";
const classificationResponseSchema = z.object({
label: z.enum(DEFAULT_SCHEMA.labels),
confidence: z.number().min(0).max(1),
reasoning: z.string(),
});
const llm = new ChatOpenAI({
modelName: "gpt-4o-mini",
temperature: 0.1,
maxTokens: 150,
}).withStructuredOutput(classificationResponseSchema);
async function fallbackClassify(text: string, embedding: number[]): Promise<z.infer<typeof classificationResponseSchema>> {
const prompt = `Classify the following content. Return only valid JSON matching the schema.
Content: "${text}"
Available labels: ${DEFAULT_SCHEMA.labels.join(", ")}
Provide a single label and a confidence score between 0 and 1.`;
return await llm.invoke(prompt);
}
Step 5: Unified Classification Pipeline
Route deterministically. Cache aggressively. Log decisions for audit.
export async function classifyContent(text: string): Promise<{
label: string;
confidence: number;
method: "classifier" | "llm_fallback";
latencyMs: number;
}> {
const start = performance.now();
const embedding = await getEmbedding(text);
const primary = routeViaClassifier(embedding);
if (primary.confidence >= DEFAULT_SCHEMA.confidenceThreshold) {
return {
label: primary.label,
confidence: primary.confidence,
method: "classifier",
latencyMs: Math.round(performance.now() - start),
};
}
if (primary.confidence >= DEFAULT_SCHEMA.fallbackThreshold) {
const fallback = await fallbackClassify(text, embedding);
return {
label: fallback.label,
confidence: fallback.confidence,
method: "llm_fallback",
latencyMs: Math.round(performance.now() - start),
};
}
return {
label: "unknown",
confidence: primary.confidence,
method: "classifier",
latencyMs: Math.round(performance.now() - start),
};
}
Architecture Decisions & Rationale
- Hybrid routing over monolithic LLMs: LLMs excel at reasoning, not deterministic routing. Offloading 80–90% of traffic to a lightweight classifier reduces cost and latency while preserving accuracy.
- Embedding versioning: Semantic drift occurs when embedding models change. Pin versions and maintain parallel pipelines during migrations.
- Structured output contracts: Zod schemas prevent JSON parsing failures and enable automated validation in downstream systems.
- Confidence thresholds as routing controls: Thresholds are not arbitrary. They should be calibrated using precision-recall curves on a held-out validation set.
- Explicit method tracking: Logging
method: "classifier" | "llm_fallback" enables cost attribution, drift detection, and compliance auditing.
Pitfall Guide
1. Treating Zero-Shot as Production-Ready
Zero-shot prompts rarely generalize to domain-specific terminology or edge cases. Production pipelines require labeled validation sets and threshold calibration. Without them, accuracy degrades silently as content distribution shifts.
2. Ignoring Class Imbalance
Real-world content is heavily skewed. Spam, compliance, and technical labels often dominate. Training classifiers on raw distributions biases predictions toward majority classes. Apply stratified sampling, class weighting, or synthetic minority oversampling during centroid/model training.
3. Missing Confidence Calibration
Raw similarity scores or LLM logits are not probabilities. Without Platt scaling or isotonic regression, confidence thresholds become arbitrary. Calibrate scores on a validation set and monitor calibration error (ECE) monthly.
4. Prompt Drift from Model Updates
LLM providers silently update model weights and tokenizers. Prompts that worked in v1 may degrade in v2. Version all prompt templates, lock model versions, and implement automated regression tests against a gold-standard dataset.
5. No Multi-Label vs Single-Label Contract
Classification schemas must explicitly declare whether multiple labels apply. Forcing single-label constraints on multi-label content causes misrouting and compliance gaps. Use one-vs-rest classifiers or multi-output LLM schemas when necessary.
6. Embedding Cache Bloat & Staleness
Caching improves latency but introduces staleness. Implement TTL-based eviction, hash-based deduplication, and periodic cache invalidation when embedding models or label sets change.
7. Skipping Continuous Evaluation
Classification is not a deploy-and-forget task. Vocabulary drift, policy changes, and adversarial content degrade performance. Implement automated evaluation pipelines that run against a refreshed test set weekly. Track F1, latency, cost, and calibration error.
Production Best Practices
- Pin embedding and LLM model versions. Document migration paths.
- Use structured output schemas with strict validation.
- Route by confidence, not by arbitrary cutoffs. Calibrate thresholds quarterly.
- Log routing decisions, confidence scores, and fallback triggers for auditability.
- Separate training data pipelines from inference to prevent data leakage.
- Implement circuit breakers for LLM fallback to prevent cost explosions during traffic spikes.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume static content (docs, logs) | Embeddings + lightweight classifier | Deterministic, sub-20ms latency, minimal compute | ~$0.05 per 10k items |
| Dynamic user-generated content with evolving slang | Hybrid routing with LLM fallback | Handles ambiguity, adapts to vocabulary shifts via fallback | ~$0.60 per 10k items |
| Compliance/audit-critical routing | Structured LLM with human-in-loop review | Enforces schema, provides reasoning traces, meets regulatory standards | ~$1.80 per 10k items |
| Real-time chat/message filtering | Rule-based + fast classifier | Sub-5ms latency required, acceptable accuracy tradeoff | ~$0.02 per 10k items |
| Multi-domain enterprise content | Multi-label classifier + domain router | Prevents cross-domain label leakage, enables granular routing | ~$0.75 per 10k items |
Configuration Template
classification:
schema:
labels: ["technical", "marketing", "compliance", "support", "spam", "unknown"]
multi_label: false
confidence_threshold: 0.85
fallback_threshold: 0.60
embedding:
model: "text-embedding-3-small"
version: "2024-09"
cache_ttl_seconds: 86400
max_cache_size: 500000
classifier:
type: "cosine_centroid"
retrain_interval_days: 30
calibration_method: "isotonic"
fallback:
model: "gpt-4o-mini"
version: "2024-08"
temperature: 0.1
max_tokens: 150
circuit_breaker:
failure_threshold: 50
reset_timeout_seconds: 60
max_concurrent: 20
monitoring:
metrics: ["f1_score", "latency_p95", "cost_per_10k", "calibration_error"]
evaluation_interval_hours: 168
alert_on_drift: true
drift_threshold: 0.04
Quick Start Guide
- Initialize project:
npm init -y && npm install @langchain/openai zod zod-to-json-schema crypto
- Set environment variables:
OPENAI_API_KEY, EMBEDDING_MODEL, LLM_MODEL
- Run embedding calibration: Execute
classifyContent against 500 labeled samples to compute label centroids and validate thresholds
- Deploy routing service: Wrap
classifyContent in an Express/Fastify endpoint with rate limiting and health checks
- Enable monitoring: Attach metrics exporter (Prometheus/Datadog) to track F1, latency, cost, and calibration error. Schedule weekly evaluation jobs.
Classification pipelines succeed when treated as deterministic routing systems with probabilistic fallbacks. Pin versions, calibrate thresholds, enforce schemas, and monitor drift. The architecture scales, the cost stays bounded, and the audit trail remains intact.