n**, Context Grounding, and Verification.
Architecture Decision Rationale
- Separation of Concerns: Do not embed verification logic in the generation prompt. Use a distinct verification step to enforce grounding deterministically.
- Hybrid Retrieval: Vector search alone fails on exact keyword matches and structured data. Combine dense embeddings with BM25 sparse retrieval.
- Claim Extraction: Verify at the claim level, not the document level. Chunking verification reduces false negatives where a document contains relevant info but doesn't support the specific claim.
Step-by-Step Implementation
1. Retrieval Optimization
Implement hybrid search with metadata filtering and re-ranking. Use a cross-encoder re-ranker to boost relevant chunks before passing to the LLM.
2. Context Grounding via Structured Output
Force the LLM to output structured JSON with explicit citations for every claim. This enables programmatic verification.
3. Verification Layer
Implement an NLI-based verifier. For each claim, check entailment against the cited source chunks.
4. Fallback Router
If verification fails, route to a fallback response or human-in-the-loop rather than returning unverified content.
Code Implementation (TypeScript)
The following implementation demonstrates a HallucinationGuard pipeline that enforces grounding and handles verification failures.
import { OpenAI } from 'openai';
import { z } from 'zod';
// Schema for structured generation with citations
const GroundedResponseSchema = z.object({
answer: z.string().describe('The direct answer to the query.'),
claims: z.array(z.object({
text: z.string(),
source_chunk_id: z.string(),
confidence: z.number().min(0).max(1)
})).describe('Individual claims with citations.')
});
type GroundedResponse = z.infer<typeof GroundedResponseSchema>;
interface VerificationResult {
isGrounded: boolean;
ungroundedClaims: string[];
confidence: number;
}
class HallucinationGuard {
private openai: OpenAI;
private verifierModel: string;
constructor(apiKey: string, verifierModel: string = 'gpt-4o-mini') {
this.openai = new OpenAI({ apiKey });
this.verifierModel = verifierModel;
}
async generateAndVerify(
query: string,
contextChunks: { id: string; content: string }[]
): Promise<{ response: string; verified: boolean; fallback: boolean }> {
// Step 1: Generate structured response
const response = await this.openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a factual assistant. Answer using ONLY the provided context. Output JSON.' },
{ role: 'user', content: `Context: ${JSON.stringify(contextChunks)}\nQuery: ${query}` }
],
response_format: { type: 'json_schema', schema: GroundedResponseSchema },
temperature: 0.1, // Low temperature reduces variance
top_p: 0.9
});
const parsed = GroundedResponseSchema.parse(JSON.parse(response.choices[0].message.content || '{}'));
// Step 2: Verify each claim
const verification = await this.verifyClaims(parsed.claims, contextChunks);
// Step 3: Route based on verification
if (verification.isGrounded) {
return { response: parsed.answer, verified: true, fallback: false };
} else {
// Fallback logic
return {
response: 'I cannot verify the answer with the provided information.',
verified: false,
fallback: true
};
}
}
private async verifyClaims(
claims: { text: string; source_chunk_id: string }[],
chunks: { id: string; content: string }[]
): Promise<VerificationResult> {
const ungroundedClaims: string[] = [];
let totalConfidence = 0;
for (const claim of claims) {
const source = chunks.find(c => c.id === claim.source_chunk_id);
if (!source) {
ungroundedClaims.push(claim.text);
continue;
}
// NLI Verification: Check if source entails claim
const verificationPrompt = `
Premise: ${source.content}
Hypothesis: ${claim.text}
Does the premise entail the hypothesis? Return only "YES" or "NO".
`;
const result = await this.openai.chat.completions.create({
model: this.verifierModel,
messages: [{ role: 'user', content: verificationPrompt }],
temperature: 0,
max_tokens: 5
});
const verdict = result.choices[0].message.content?.trim().toUpperCase();
if (verdict !== 'YES') {
ungroundedClaims.push(claim.text);
}
totalConfidence += (verdict === 'YES' ? 1 : 0);
}
const avgConfidence = claims.length > 0 ? totalConfidence / claims.length : 0;
return {
isGrounded: ungroundedClaims.length === 0 && avgConfidence > 0.8,
ungroundedClaims,
confidence: avgConfidence
};
}
}
export { HallucinationGuard, GroundedResponseSchema };
Architecture Notes:
- Temperature Control:
temperature is set to 0.1 for generation. Higher values increase the probability of sampling tokens that diverge from the context.
- JSON Schema Enforcement: Using
json_schema response format ensures the output structure is parseable and prevents the model from outputting free-text that breaks verification.
- Verifier Model: A smaller model (
gpt-4o-mini) suffices for NLI tasks, reducing cost. For extreme cost sensitivity, a local DeBERTa model can replace the API call.
- Citation Mapping: The
source_chunk_id links claims to specific retrieval chunks, enabling precise debugging of retrieval failures.
Pitfall Guide
1. Negative Constraint Reliance
Mistake: Using prompts like "Do not make up information" or "Never hallucinate."
Explanation: LLMs struggle with negative constraints. They often process the concept ("make up information") and inadvertently increase its probability.
Best Practice: Use positive constraints: "Base your answer strictly on the provided text. If the text does not contain the answer, state that you cannot answer."
2. Retrieval Noise Saturation
Mistake: Retrieving too many chunks or chunks with low relevance scores.
Explanation: Excessive context dilutes the signal. The model may hallucinate by blending information from irrelevant chunks or prioritizing noise over signal.
Best Practice: Limit context to top-5 chunks after re-ranking. Implement a relevance threshold; discard chunks below a similarity score.
3. Citation Hallucination
Mistake: Assuming the model will cite the correct source if asked.
Explanation: Models often hallucinate citations, referencing a document that exists but does not support the claim, or fabricating a document ID.
Best Practice: Implement programmatic citation verification. Extract the citation ID from the output and verify the claim against that specific chunk in the verification step.
4. Temperature Drift in Production
Mistake: Leaving temperature at default (0.7) for factual tasks.
Explanation: Temperature controls randomness. High temperature increases the likelihood of the model selecting tokens that are plausible but unsupported by context.
Best Practice: Set temperature <= 0.2 for all factual, RAG, and classification tasks. Use higher temperatures only for creative generation where hallucination is acceptable.
5. Verification Model Bias
Mistake: Using the same model instance for generation and verification.
Explanation: The model may reinforce its own errors due to shared weights and context bias.
Best Practice: Use a distinct model for verification. A smaller, specialized NLI model or a different architecture (e.g., DeBERTa) provides independent validation.
6. Ignoring "I Don't Know" Fallbacks
Mistake: Forcing the model to answer even when context is insufficient.
Explanation: Pressure to answer drives hallucination.
Best Practice: Configure the system to output a safe fallback message when verification confidence drops below a threshold. Train the model to recognize knowledge boundaries.
7. Context Window Truncation
Mistake: Truncating context arbitrarily without preserving semantic boundaries.
Explanation: Cutting off a chunk mid-sentence can remove critical negation or qualification, leading to misinterpretation and hallucination.
Best Practice: Truncate at sentence or paragraph boundaries. Ensure the most relevant information is retained at the beginning of the context window.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Customer Support | RAG + Citation Verification | High accuracy required; citations enable agent review. | Medium |
| Medical/Legal Advice | RAG + Dedicated Verifier + Human-in-Loop | Zero tolerance for error; verification + human review ensures compliance. | High |
| Internal Knowledge Base | RAG + Self-Correction | Cost-sensitive; acceptable error margin for internal use. | Low |
| Creative Content | Zero-Shot + Constraints | Hallucination is a feature for creativity; verification adds unnecessary latency. | Low |
| Real-time Chatbot | RAG + Lightweight Verifier | Latency constraints require efficient verification; fallback ensures responsiveness. | Medium |
Configuration Template
# hallucination-mitigation-config.yaml
pipeline:
retrieval:
strategy: hybrid
top_k: 5
reranker:
model: "cross-encoder-ms-marco"
threshold: 0.65
chunking:
size: 512
overlap: 50
generation:
model: "gpt-4o"
temperature: 0.1
top_p: 0.9
structured_output: true
schema: "grounded_response_schema.json"
verification:
enabled: true
model: "gpt-4o-mini" # Or local "deberta-v3-nli"
method: "claim_level_nli"
confidence_threshold: 0.8
max_ungrounded_claims: 0
fallback:
enabled: true
message: "I cannot verify this information with the available data. Please consult a specialist."
trigger: "verification_failed"
monitoring:
metrics:
- "hallucination_rate"
- "citation_accuracy"
- "verification_latency_ms"
alerting:
threshold: 0.05 # Alert if hallucination rate > 5%
Quick Start Guide
- Initialize Pipeline: Copy the
HallucinationGuard class and configuration template into your project. Install dependencies (openai, zod).
- Configure Retrieval: Connect your vector database and implement the hybrid search function. Ensure chunks include unique IDs.
- Set Verification Model: Choose a verifier. For production, start with
gpt-4o-mini. For cost optimization, deploy a local NLI model.
- Run Verification Test: Execute the
generateAndVerify method with a test query and mock context. Validate that ungrounded claims trigger the fallback.
- Deploy and Monitor: Integrate into your API. Enable logging for verification results and set up alerts for hallucination rate spikes.
This architecture provides a robust, scalable solution to LLM hallucination, balancing accuracy, latency, and cost for enterprise-grade applications.