iteria
Before coding, establish constraints:
- Task: Semantic similarity, retrieval (BM25 hybrid), classification, or clustering?
- Domain: General text, code, multilingual, or specific jargon?
- Constraints: Max latency, budget, dimensionality limits imposed by the vector database.
- Metrics: Target Recall@K, MRR, or NDCG based on a gold-standard dataset.
Step 2: Benchmark Candidates
Use MTEB subsets and domain-specific evaluation. Do not rely solely on public leaderboards.
- Prepare Data: Create a query-document pair dataset representative of production traffic.
- Run Inference: Generate embeddings for all candidates.
- Calculate Metrics: Compute retrieval accuracy. Use
RAGAS or custom LLM-as-a-judge scripts if labeled data is scarce.
- Profile: Measure throughput and latency under load.
Step 3: Architecture Implementation
Decouple the embedding logic using the Strategy Pattern. This allows runtime switching and A/B testing.
TypeScript Implementation:
// interfaces/EmbeddingProvider.ts
export interface EmbeddingProvider {
name: string;
dimensions: number;
normalize: boolean;
embed(text: string): Promise<number[]>;
embedBatch(texts: string[]): Promise<number[][]>;
}
// providers/OpenAIProvider.ts
import { OpenAI } from 'openai';
export class OpenAIProvider implements EmbeddingProvider {
name = 'openai';
dimensions = 3072;
normalize = true; // OpenAI returns normalized vectors
private client: OpenAI;
private model: string;
constructor(apiKey: string, model: string = 'text-embedding-3-large') {
this.client = new OpenAI({ apiKey });
this.model = model;
}
async embed(text: string): Promise<number[]> {
const response = await this.client.embeddings.create({
model: this.model,
input: text,
dimensions: this.dimensions,
});
return response.data[0].embedding;
}
async embedBatch(texts: string[]): Promise<number[][]> {
const response = await this.client.embeddings.create({
model: this.model,
input: texts,
dimensions: this.dimensions,
});
return response.data.map(d => d.embedding);
}
}
// providers/OllamaProvider.ts
import axios from 'axios';
export class OllamaProvider implements EmbeddingProvider {
name = 'ollama';
dimensions: number;
normalize = false; // Local models may require manual normalization
private baseUrl: string;
private model: string;
constructor(baseUrl: string, model: string, dimensions: number) {
this.baseUrl = baseUrl;
this.model = model;
this.dimensions = dimensions;
}
async embed(text: string): Promise<number[]> {
const response = await axios.post(`${this.baseUrl}/api/embed`, {
model: this.model,
input: text,
});
let embedding = response.data.embeddings[0];
return this.normalize ? this.l2Normalize(embedding) : embedding;
}
async embedBatch(texts: string[]): Promise<number[][]> {
const response = await axios.post(`${this.baseUrl}/api/embed`, {
model: this.model,
input: texts,
});
let embeddings = response.data.embeddings;
return this.normalize ? embeddings.map(e => this.l2Normalize(e)) : embeddings;
}
private l2Normalize(vector: number[]): number[] {
const magnitude = Math.sqrt(vector.reduce((sum, val) => sum + val * val, 0));
return vector.map(val => val / magnitude);
}
}
// services/EmbeddingService.ts
export class EmbeddingService {
private provider: EmbeddingProvider;
constructor(provider: EmbeddingProvider) {
this.provider = provider;
}
switchProvider(provider: EmbeddingProvider) {
this.provider = provider;
}
async generateEmbedding(text: string): Promise<number[]> {
return this.provider.embed(text);
}
}
Step 4: Optimization Strategies
- Quantization: If using self-hosted models, apply quantization (e.g.,
Q4_K_M) to reduce VRAM usage and increase throughput with minimal accuracy loss.
- Dimensionality Reduction: If the vector database supports it, truncate embeddings. Research indicates that for many tasks, the first 256-512 dimensions of high-dimensional models capture the majority of semantic signal. Truncation reduces storage and index build time.
- Hybrid Search: Combine embeddings with sparse retrieval (BM25). Embeddings handle semantic intent; BM25 handles exact keyword matching. This mitigates embedding model weaknesses.
Step 5: Monitoring
Implement embedding drift detection. Monitor the distribution of embedding vectors over time. Significant shifts may indicate data drift or model degradation. Track retrieval latency and error rates in your observability stack.
Pitfall Guide
-
Ignoring Normalization Requirements
- Mistake: Using dot product similarity on unnormalized embeddings or cosine similarity on non-L2-normalized vectors.
- Impact: Distance calculations become invalid, leading to random retrieval results.
- Fix: Verify if the model outputs normalized vectors. If not, apply L2 normalization explicitly before storage. Ensure the vector database distance metric matches the normalization state.
-
Dimensionality Mismatch
- Mistake: Switching models without updating vector index dimensions.
- Impact: Index build failures or silent data corruption.
- Fix: Validate
dimensions property in the provider interface against vector DB schema during initialization. Implement migration scripts for dimension changes.
-
Domain Shift Without Fine-Tuning
- Mistake: Using a general model for highly technical domains (e.g., Rust assembly code, legal contracts).
- Impact: Low recall for domain-specific queries.
- Fix: Evaluate domain-specific models. If performance is insufficient, consider fine-tuning on a domain corpus using contrastive learning.
-
Context Window Truncation
- Mistake: Feeding long documents to models with short context windows without chunking.
- Impact: Loss of critical information at the end of documents.
- Fix: Implement semantic chunking. Ensure chunk size aligns with the model's context limit. Use models with extended context windows for long-document retrieval.
-
Cost Blindness on High Dimensions
- Mistake: Selecting a 3072-d model when a 768-d model achieves 98% of the accuracy.
- Impact: 4x storage costs and slower queries.
- Fix: Perform ablation studies. Test retrieval quality with truncated embeddings. Choose the lowest dimensionality that meets accuracy thresholds.
-
Latency Spikes During Peak Load
- Mistake: Relying on a single API endpoint without rate limiting or fallbacks.
- Impact: Request timeouts and cascading failures.
- Fix: Implement circuit breakers. Use a provider router with fallback capabilities (e.g., switch to local model if API latency exceeds threshold).
-
Black Box Evaluation
- Mistake: Trusting model claims without testing on production data.
- Impact: Deployment of models that fail on edge cases.
- Fix: Always run a golden dataset evaluation before migration. Include edge cases and adversarial queries.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Volume, Low Budget | Self-hosted nomic-embed-text | High throughput, low dimensions, zero API cost. Requires GPU/CPU infra. | High CapEx, Low OpEx. ~90% cost reduction vs API. |
| Critical Accuracy, Mixed Domain | Proprietary text-embedding-3-large | Best-in-class general performance, robust API, handles diverse inputs well. | High OpEx. ~$0.13/1M tokens + storage overhead. |
| Real-Time Edge Application | On-device all-MiniLM-L6-v2 | Extremely low latency, runs on CPU, minimal memory footprint. | Zero inference cost. Storage/Compute on device. |
| Multilingual Enterprise | BGE-M3 or jina-embeddings-v3 | Native support for 100+ languages, unified embedding space. | Moderate. Self-hosted recommended for scale. |
| Code-Specific Retrieval | Starcoder embeddings or fine-tuned BGE | General models lack code syntax understanding. Specialized models improve recall. | Moderate. Fine-tuning requires dataset and compute. |
Configuration Template
Docker Compose for Ollama Serving (Self-Hosted):
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_NUM_GPU=999 # Maximize GPU usage
- OLLAMA_KEEP_ALIVE=24h
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command: serve
embedding-worker:
build: ./embedding-worker
depends_on:
- ollama
environment:
- OLLAMA_URL=http://ollama:11434
- MODEL_NAME=nomic-embed-text
- BATCH_SIZE=64
deploy:
replicas: 2
volumes:
ollama_data:
TypeScript Config for Dynamic Routing:
// config/embedding-config.ts
export const EMBEDDING_CONFIG = {
primary: {
provider: 'ollama',
model: 'nomic-embed-text',
fallback: {
provider: 'openai',
model: 'text-embedding-3-small',
triggerLatencyMs: 200, // Fallback if primary takes > 200ms
},
},
vectorDb: {
dimensions: 768,
distanceMetric: 'cosine', // Matches L2 normalized vectors
},
monitoring: {
enabled: true,
driftThreshold: 0.05, // Alert if distribution shift > 5%
},
};
Quick Start Guide
- Install Dependencies:
npm install @langchain/community @langchain/core axios
- Initialize Provider:
import { OllamaProvider } from './providers/OllamaProvider';
const provider = new OllamaProvider('http://localhost:11434', 'nomic-embed-text', 768);
const service = new EmbeddingService(provider);
- Generate Embedding:
const text = "Embedding model selection impacts retrieval quality.";
const embedding = await service.generateEmbedding(text);
console.log(`Dimensions: ${embedding.length}`);
- Verify Index:
Ensure your vector database collection is configured with
dimensions: 768 and distance: cosine.
- Benchmark:
Run a sample query against your dataset. Measure
Recall@5. If below target, swap provider in config and re-evaluate.