levance. A lightweight cross-encoder (e.g., ms-marco-MiniLM-L-6-v2 or bge-reranker-v2-m3) scores the top 20β50 candidates from hybrid retrieval, applying attention across both query and document tokens.
3. Metadata-First Filtering: Vector databases are not relational engines. Filtering on timestamps, categories, or access control must occur before or alongside vector search to prevent scanning irrelevant partitions.
4. Asynchronous Ingestion Pipeline: Embedding generation and index updates must be decoupled from user requests. A message queue (Redis Streams or SQS) handles chunking, embedding, and upserts with dead-letter retry logic.
Step-by-Step Implementation
Step 1: Document Chunking and Embedding
Fixed-size chunking with overlap preserves context boundaries. Semantic-aware chunking (splitting on headings, code blocks, or paragraph breaks) improves retrieval granularity.
import { createHash } from 'crypto';
interface Chunk {
id: string;
content: string;
metadata: Record<string, string | number | boolean>;
}
export function chunkDocument(text: string, maxTokens: number = 300): Chunk[] {
const sentences = text.split(/(?<=[.!?])\s+/);
const chunks: Chunk[] = [];
let current: string[] = [];
let tokenCount = 0;
for (const sentence of sentences) {
const tokens = sentence.split(/\s+/).length;
if (tokenCount + tokens > maxTokens && current.length > 0) {
chunks.push(buildChunk(current.join(' ')));
current = current.slice(-2); // overlap
tokenCount = current.join(' ').split(/\s+/).length;
}
current.push(sentence);
tokenCount += tokens;
}
if (current.length > 0) chunks.push(buildChunk(current.join(' ')));
return chunks;
}
function buildChunk(content: string): Chunk {
return {
id: createHash('sha256').update(content).digest('hex').slice(0, 12),
content,
metadata: { source: 'docs', version: '1.0' }
};
}
Step 2: Hybrid Query Execution
RRF merges lexical and vector results without requiring score normalization.
import { OpenAI } from 'openai';
import { Client } from '@elastic/elasticsearch'; // BM25
import { WeaviateClient } from 'weaviate-ts-client'; // Vector
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const esClient = new Client({ node: process.env.ELASTIC_URL });
const wvClient = new WeaviateClient({ scheme: 'https', host: process.env.WEAVIATE_HOST });
export async function hybridSearch(query: string, k: number = 20) {
// 1. Generate query embedding
const embeddingRes = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query,
dimensions: 1536
});
const queryVector = embeddingRes.data[0].embedding;
// 2. Parallel BM25 + Vector search
const [bm25Res, vectorRes] = await Promise.all([
esClient.search({
index: 'documents',
body: { query: { match: { content: query } }, size: k }
}),
wvClient.graphql.get()
.withClassName('Document')
.withNearVector({ vector: queryVector })
.withLimit(k)
.do()
]);
// 3. Reciprocal Rank Fusion
const rrf = new Map<string, number>();
const rank = (docId: string, rank: number) => {
rrf.set(docId, (rrf.get(docId) || 0) + 1 / (rank + 60));
};
bm25Res.body.hits.hits.forEach((hit, i) => rank(hit._id, i));
vectorRes.data.Get.Document.forEach((doc: any, i: number) => rank(doc._additional.id, i));
// 4. Sort by RRF score and return top-k
return Array.from(rrf.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, k)
.map(([id]) => id);
}
Step 3: Cross-Encoder Reranking
Bi-encoder retrieval returns candidates; reranking scores them jointly.
import { pipeline } from '@xenova/transformers';
const reranker = await pipeline('text-classification', 'Xenova/ms-marco-MiniLM-L-6-v2');
export async function rerankCandidates(query: string, candidateIds: string[], documents: Map<string, string>) {
const pairs = candidateIds.map(id => [query, documents.get(id) || '']);
const results = await reranker(pairs, { topk: 10 });
return results
.map((res: any, i: number) => ({
id: candidateIds[i],
score: res.score,
content: documents.get(candidateIds[i]) || ''
}))
.sort((a, b) => b.score - a.score);
}
Step 4: Synthesis with Controlled Context
LLM synthesis must enforce citation, temperature constraints, and context window limits.
export async function generateAnswer(query: string, rankedDocs: any[]) {
const context = rankedDocs.slice(0, 5).map(d => `[${d.id}] ${d.content}`).join('\n\n');
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'Answer using ONLY the provided context. Cite sources with [id]. If unknown, state so.' },
{ role: 'user', content: `Context:\n${context}\n\nQuestion: ${query}` }
],
temperature: 0,
max_tokens: 500
});
return response.choices[0].message.content;
}
Pitfall Guide
-
Ignoring Chunk Boundaries
Splitting documents at arbitrary byte boundaries fractures semantic units. Code blocks, tables, and lists lose structural meaning. Use AST-aware or markdown-aware chunkers that respect headings, code fences, and paragraph breaks. Overlap of 10β15% preserves context continuity without excessive duplication.
-
Skipping Reranking
Bi-encoders optimize for fast similarity, not relevance. Without a cross-encoder reranker, systems return documents that share vocabulary but miss intent. Production deployments show 12β22% MRR improvement after adding reranking, even with lightweight models.
-
Misconfiguring HNSW Parameters
m (connections per node) and ef_search (candidate pool size) directly control recall vs latency. Default values often underperform. For 1M+ vectors, set m=16β32, ef_construction=200β400, and tune ef_search via load testing. Lower ef_search reduces latency but increases false negatives.
-
Filtering After Vector Search
Applying WHERE clauses on metadata post-ANN search forces full index scans. Push filters to the vector database query layer. Use partitioned indices or metadata-aware hybrid search to restrict the candidate space before distance computation.
-
Query Normalization Blind Spots
Users type abbreviations, typos, and domain-specific jargon. Raw embeddings amplify these variations. Implement a lightweight normalization layer: expand acronyms via domain dictionary, apply fuzzy matching for critical terms, and strip stop words only after semantic intent is preserved.
-
Cold Start Ingestion Failures
New documents without embeddings break retrieval pipelines. Implement async ingestion with idempotent upserts, dead-letter queues for failed embeddings, and a fallback BM25 index that activates until vectors are ready. Monitor ingestion lag via consumer group offsets.
-
Unconstrained LLM Synthesis
Passing raw ranked documents to an LLM without structure invites hallucination and token waste. Enforce strict system prompts, limit context to top-5 reranked chunks, require citation formatting, and set temperature=0. Validate output against source IDs before returning to users.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time chatbot (<100ms SLA) | BM25 + lightweight vector (128-dim) + shallow rerank | Latency constraint prioritizes speed; low-dim embeddings reduce compute | Low infrastructure, moderate API cost |
| Enterprise knowledge base (high accuracy) | Hybrid retrieval + cross-encoder reranker + strict synthesis | Precision critical for compliance; reranker recovers interaction signals | Higher compute, lower support ticket volume |
| Low-budget MVP (<10k docs) | Open-source embeddings + pgvector + RRF only | Eliminates paid reranker; pgvector scales well for small datasets | Minimal SaaS cost, manageable self-hosted ops |
| Multi-tenant SaaS | Hybrid search + metadata partitioning + per-tenant reranking | Isolation prevents cross-tenant leakage; partitioned indices improve filter performance | Moderate storage overhead, high security ROI |
Configuration Template
# docker-compose.yml
services:
weaviate:
image: semitechnologies/weaviate:1.25.0
environment:
QUERY_DEFAULTS_LIMIT: 25
CLUSTER_HOSTNAME: "node1"
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
ports: ["8080:8080"]
volumes: ["weaviate_data:/var/lib/weaviate"]
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports: ["9200:9200"]
volumes: ["es_data:/usr/share/elasticsearch/data"]
redis:
image: redis:7.2-alpine
ports: ["6379:6379"]
command: ["redis-server", "--save", "", "--appendonly", "no"]
volumes:
weaviate_data:
es_data:
// search.config.ts
export const searchConfig = {
embedding: {
model: 'text-embedding-3-small',
dimensions: 1536,
batchSize: 100,
retryAttempts: 3
},
retrieval: {
bm25Index: 'documents',
vectorClass: 'Document',
hybridAlpha: 0.5, // RRF weight balance
topKBeforeRerank: 30,
topKAfterRerank: 5
},
reranker: {
model: 'Xenova/ms-marco-MiniLM-L-6-v2',
maxSequenceLength: 512,
device: 'cpu' // or 'cuda' for GPU
},
synthesis: {
model: 'gpt-4o-mini',
temperature: 0,
maxTokens: 500,
requireCitations: true
},
monitoring: {
metricsEndpoint: '/metrics',
latencyPercentiles: [50, 95, 99],
fallbackThreshold: 200 // ms
}
};
Quick Start Guide
- Spin up infrastructure: Run
docker compose up -d to initialize Weaviate, Elasticsearch, and Redis. Verify health endpoints return 200 OK.
- Ingest sample documents: Execute the chunking and embedding pipeline against a test dataset. Confirm vectors are upserted to Weaviate and BM25 documents are indexed in Elasticsearch.
- Execute hybrid query: Call the
hybridSearch function with a test query. Validate that RRF merges results and returns ranked IDs.
- Apply reranking and synthesis: Pass candidate IDs to the reranker, then feed top results to the LLM synthesis function. Verify citations match source IDs and latency stays within SLA thresholds.