Implementation
The following TypeScript implementation demonstrates a production-ready RAG pipeline. It includes error handling, context window management, and an evaluation step.
import { createClient, VectorStore } from '@vectorstore/sdk';
import { LLMClient, EmbeddingModel } from '@ai-providers/sdk';
import { z } from 'zod';
// --- Types & Interfaces ---
interface ContextChunk {
id: string;
text: string;
metadata: Record<string, any>;
score: number;
}
interface AIResponse {
content: string;
sources: string[];
confidence: number;
}
interface EvaluationResult {
passed: boolean;
score: number;
reasons: string[];
}
interface RAGConfig {
embeddingModel: string;
generationModel: string;
topK: number;
minScore: number;
maxContextTokens: number;
}
// --- Core Pipeline ---
class AIProductPipeline {
private vectorStore: VectorStore;
private llm: LLMClient;
private embeddingModel: EmbeddingModel;
private config: RAGConfig;
constructor(config: RAGConfig) {
this.config = config;
this.vectorStore = createClient(process.env.VECTOR_DB_URL);
this.llm = new LLMClient(process.env.LLM_API_KEY);
this.embeddingModel = new EmbeddingModel(process.env.EMBEDDING_API_KEY);
}
async generateResponse(query: string): Promise<AIResponse> {
try {
// 1. Embed Query
const queryVector = await this.embeddingModel.embed(query);
// 2. Retrieve Context
const chunks = await this.retrieveContext(queryVector, query);
if (chunks.length === 0) {
return this.handleFallback(query);
}
// 3. Augment & Generate
const prompt = this.buildPrompt(query, chunks);
const rawResponse = await this.llm.generate(prompt, {
model: this.config.generationModel,
temperature: 0.1, // Low temperature for factual grounding
});
// 4. Evaluate Response
const evalResult = await this.evaluate(rawResponse, chunks);
if (!evalResult.passed) {
console.warn(`Evaluation failed: ${evalResult.reasons.join(', ')}`);
return this.handleFallback(query, evalResult);
}
return {
content: rawResponse,
sources: chunks.map(c => c.id),
confidence: evalResult.score,
};
} catch (error) {
// Production error handling with observability
console.error('Pipeline error:', error);
throw new Error('AI service unavailable');
}
}
private async retrieveContext(queryVector: number[], query: string): Promise<ContextChunk[]> {
const results = await this.vectorStore.similaritySearch({
vector: queryVector,
topK: this.config.topK,
filter: { active: true }, // Example metadata filter
});
// Hybrid re-ranking or keyword filtering could be applied here
return results
.filter(r => r.score >= this.config.minScore)
.map(r => ({
id: r.id,
text: r.text,
metadata: r.metadata,
score: r.score,
}));
}
private buildPrompt(query: string, chunks: ContextChunk[]): string {
// Context window management: truncate if necessary
const contextText = chunks
.slice(0, this.config.maxContextTokens / 100) // Rough estimate
.map(c => `<context>${c.text}</context>`)
.join('\n');
return `
You are a helpful assistant. Answer the user's question based *only* on the provided context.
If the context does not contain the answer, state that you cannot answer based on the available information.
Context:
${contextText}
Question: ${query}
Answer:
`;
}
private async evaluate(response: string, chunks: ContextChunk[]): Promise<EvaluationResult> {
// LLM-as-a-Judge or deterministic metric evaluation
const evalPrompt = `
Evaluate the following response based on the provided context.
Criteria:
1. Groundedness: Is the response supported by the context?
2. Relevance: Does the response answer the query?
Context: ${chunks.map(c => c.text).join(' ')}
Response: ${response}
Return JSON: { "passed": boolean, "score": number, "reasons": string[] }
`;
const evalOutput = await this.llm.generate(evalPrompt, {
model: 'evaluation-model-v1', // Smaller, faster model
response_format: { type: 'json_object' },
});
const schema = z.object({
passed: z.boolean(),
score: z.number(),
reasons: z.array(z.string()),
});
return schema.parse(JSON.parse(evalOutput));
}
private handleFallback(query: string, evalResult?: EvaluationResult): AIResponse {
// Implement fallback logic: e.g., return "I don't know" or route to human agent
return {
content: "I'm unable to provide a definitive answer based on the current information.",
sources: [],
confidence: 0,
};
}
}
Rationale
- Low Temperature: Setting
temperature: 0.1 minimizes creativity, ensuring the model adheres strictly to the context.
- Evaluation Step: The
evaluate method prevents hallucinations from reaching the user. It uses a dedicated evaluation model to keep latency low while maintaining rigorous checks.
- Context Management: The
buildPrompt method includes logic to slice chunks based on token estimates, preventing context window overflow errors.
- Fallback Strategy: The pipeline degrades gracefully. If retrieval fails or evaluation rejects the response, a safe fallback is returned.
Pitfall Guide
Production experience reveals specific failure modes that can derail AI products. Avoid these pitfalls to ensure stability and scalability.
-
Ignoring Evaluation Debt
- Mistake: Shipping without automated evaluation. Relying on manual testing is insufficient for non-deterministic systems.
- Best Practice: Implement a continuous evaluation harness. Run evals on every model update and periodically on production traffic. Use metrics like groundedness, faithfulness, and answer relevance.
-
Hardcoding Prompts
- Mistake: Embedding prompts directly in code. This makes updates difficult and prevents A/B testing.
- Best Practice: Externalize prompts to a versioned store. Use prompt management tools to update prompts without code deployments. Version control allows rollback if a prompt change degrades performance.
-
Context Window Overflow
- Mistake: Retrieving too many chunks or using unbounded text, causing API errors or truncation.
- Best Practice: Implement dynamic context window management. Chunk documents semantically, retrieve a fixed top-K, and truncate based on token counts. Prioritize chunks with higher relevance scores.
-
Cost Blindness
- Mistake: No monitoring of token usage or cost per query. Costs spiral due to prompt injection or inefficient pipelines.
- Best Practice: Implement cost tracking and budget caps. Use model routing to send simple queries to cheaper models. Monitor average tokens per query and alert on anomalies.
-
Data Leakage and PII
- Mistake: Ingesting sensitive data into vector stores without redaction.
- Best Practice: Implement a PII redaction pipeline before embedding. Use access controls on the vector store to ensure retrieval respects user permissions. Audit data flows regularly.
-
Silent Failures
- Mistake: The model returns a plausible but incorrect answer without indicating uncertainty.
- Best Practice: Require confidence scores. If confidence is below a threshold, trigger a fallback. Use the evaluation layer to detect low-confidence responses and suppress them.
-
Over-Reliance on Single Model
- Mistake: Building the entire product around one LLM provider. Outages or API changes break the product.
- Best Practice: Abstract the LLM interface. Support multiple providers and models. Implement automatic fallback to secondary providers in case of primary provider failure.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Domain-Specific Q&A | RAG with Hybrid Search | Grounded answers reduce hallucination; easy to update knowledge base. | Medium per-query; low upfront. |
| Structured Data Extraction | Fine-Tuned Model | High accuracy on specific formats; faster inference than RAG. | High upfront; low per-query. |
| Creative Content Generation | Prompting with Guardrails | Flexibility required; RAG may constrain creativity unnecessarily. | Low upfront; medium per-query. |
| Real-Time Chatbot | RAG + Streaming + Eval | Low latency streaming with evaluation ensures quality without blocking. | Medium per-query; requires infra. |
| Legacy System Integration | Model Router + API Wrapper | Abstracts AI complexity; allows gradual migration to full RAG. | Low upfront; scales with usage. |
Configuration Template
Copy this TypeScript configuration to bootstrap your AI product settings. Adjust values based on your specific requirements and model capabilities.
// ai.config.ts
export const AIConfig = {
models: {
embedding: {
provider: 'openai',
model: 'text-embedding-3-large',
dimensions: 1536,
},
generation: {
primary: {
provider: 'anthropic',
model: 'claude-3-5-sonnet-20240620',
maxTokens: 1024,
temperature: 0.1,
},
fallback: {
provider: 'openai',
model: 'gpt-4o-mini',
maxTokens: 512,
temperature: 0.1,
},
evaluation: {
provider: 'openai',
model: 'gpt-4o-mini',
maxTokens: 256,
},
},
},
retrieval: {
vectorStore: 'pinecone',
topK: 5,
minScore: 0.75,
maxContextTokens: 4000,
},
evaluation: {
enabled: true,
thresholds: {
groundedness: 0.8,
relevance: 0.85,
},
},
observability: {
tracing: true,
logging: 'verbose',
costTracking: true,
},
};
Quick Start Guide
Follow these steps to initialize your AI product pipeline in under 5 minutes.
-
Install Dependencies:
Run npm install @vectorstore/sdk @ai-providers/sdk zod to install the required libraries for vector search, LLM interaction, and validation.
-
Configure Environment:
Create a .env file with your API keys and vector store URL:
LLM_API_KEY=sk-...
EMBEDDING_API_KEY=sk-...
VECTOR_DB_URL=https://your-vector-store...
-
Initialize Pipeline:
Import the AIProductPipeline class and instantiate it with your configuration:
import { AIProductPipeline } from './pipeline';
import { AIConfig } from './ai.config';
const pipeline = new AIProductPipeline(AIConfig.retrieval);
-
Run Test Query:
Execute a test query to verify the pipeline:
const response = await pipeline.generateResponse("How do I reset my password?");
console.log(response.content);
-
Verify Evaluation:
Check the evaluation logs to ensure the evaluation layer is functioning. Confirm that the response passed the groundedness and relevance checks before deployment.