dural guides, error code references, policy statements, and FAQ pairs. Flat chunking destroys semantic boundaries. Use semantic chunking aligned to headings, sections, and logical breakpoints, with 10β15% overlap to preserve cross-reference context.
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
export async function chunkSupportDocs(rawDocs: string[]): Promise<string[]> {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 120,
separators: ['\n## ', '\n### ', '\n- ', '\n\n', '\n', ' '],
});
const chunks: string[] = [];
for (const doc of rawDocs) {
const split = await splitter.splitText(doc);
chunks.push(...split);
}
return chunks;
}
Step 2: Embedding & Hybrid Retrieval
Support queries mix semantic intent with exact identifiers (order IDs, error codes, SKUs). Dense embeddings alone miss exact matches. Implement hybrid search: dense vectors for intent, BM25 for lexical precision, with a cross-encoder reranker to merge results.
import { createClient } from '@libsql/client';
import { embed } from './embedding-client'; // OpenAI / Cohere wrapper
export async function retrieveContext(query: string, topK: number = 5) {
const queryEmbedding = await embed(query);
// pgvector dense search + BM25 lexical fallback
const results = await db.execute(`
SELECT content, metadata,
(embedding <=> $1) AS dense_score,
ts_rank_cd(to_tsvector('english', content), plainto_tsquery($2)) AS lexical_score
FROM support_knowledge
ORDER BY (0.7 * dense_score + 0.3 * lexical_score) DESC
LIMIT $3
`, [queryEmbedding, query, topK]);
return results.rows.map(r => ({
content: r.content,
metadata: JSON.parse(r.metadata),
score: r.dense_score + r.lexical_score,
}));
}
Step 3: Intent Routing & State Management
LLMs should not route requests. A lightweight classifier (fastembed, small transformer, or distilled model) determines whether the query requires RAG, tool execution (order lookup, refund initiation), or human handoff. Maintain session state to preserve context across turns.
import { z } from 'zod';
const IntentSchema = z.object({
type: z.enum(['knowledge', 'tool', 'escalate']),
confidence: z.number().min(0).max(1),
tool_name: z.string().optional(),
parameters: z.record(z.unknown()).optional(),
});
export async function routeIntent(query: string, sessionHistory: string[]) {
const prompt = `Classify the support query. Respond only with JSON matching the schema.
History: ${sessionHistory.slice(-4).join('\n')}
Query: ${query}`;
const raw = await llm.complete(prompt, { max_tokens: 150, temperature: 0 });
const parsed = IntentSchema.parse(JSON.parse(raw));
if (parsed.confidence < 0.75 || parsed.type === 'escalate') {
return { action: 'handoff', reason: 'Low confidence or explicit escalation' };
}
return parsed;
}
Step 4: RAG Assembly & Guardrails
Ground the prompt with retrieved context. Enforce output structure with Zod. Inject citations. Block policy violations before generation completes.
import { generate } from './llm-client';
import { validateResponse } from './guardrails';
const SupportPrompt = z.object({
answer: z.string().min(10).max(500),
citations: z.array(z.string().url()),
confidence: z.number(),
next_steps: z.array(z.string()).optional(),
});
export async function generateSupportResponse(query: string, context: any[]) {
const system = `You are a support agent. Answer using ONLY the provided context.
If information is missing, state what is unknown and suggest escalation.
Output must match the required schema.`;
const user = `Context:\n${context.map(c => `- ${c.content}`).join('\n')}\n\nQuery: ${query}`;
const raw = await generate(system, user, { stream: false });
const validated = await validateResponse(raw); // Zod + policy filter
if (!validated.success) {
return { error: 'Guardrail triggered', fallback: 'human_handoff' };
}
return validated.data;
}
Architecture Decisions & Rationale
- RAG over fine-tuning: Support knowledge changes weekly. Fine-tuning creates stale policy risk and requires retraining cycles. RAG decouples knowledge from reasoning, enabling real-time updates without model redeployment.
- Hybrid retrieval: Support queries contain exact matches (error codes, SKUs) and semantic intent. Dense-only retrieval misses precision; lexical-only misses nuance. Hybrid with reranking optimizes both.
- Separate routing layer: LLMs are probabilistic. Routing is deterministic. A classifier reduces latency, cuts token cost, and prevents the LLM from attempting tool execution or policy decisions it isn't designed to make.
- Guardrails as pre-generation filters: Validation must occur before response delivery. Zod schema enforcement, citation requirements, and policy keyword scanning prevent hallucination leakage and compliance violations.
- Streaming + progressive disclosure: Support expects sub-2s first token. Stream tokens, render citations progressively, and trigger tool calls asynchronously to maintain perceived latency under 1.5s.
Pitfall Guide
-
Embedding the entire knowledge base into context
Token bloat degrades attention quality, spikes latency, and increases cost. Support docs contain redundant sections. Fix: Semantic chunking, top-k filtering, and cross-encoder reranking. Cap context to 3β5 most relevant passages.
-
Zero-shot LLM as the sole support agent
LLMs will invent policies, overpromise resolutions, or violate compliance boundaries. Fix: Strict system prompts, output schema enforcement, citation requirements, and hard fallback triggers when confidence drops below threshold.
-
Ignoring latency budgets
Support SLAs require first token under 1.5s. Complex retrieval chains and heavy models break this. Fix: Cache frequent queries, use distilled routing models, stream generation, and pre-warm vector indexes. Implement timeout fallbacks at 2s.
-
No evaluation loop
Without measurement, improvement is guesswork. Fix: Maintain a golden test set of 200β500 representative queries. Score outputs on accuracy, policy compliance, and tone. Run LLM-as-judge evaluations weekly. Track drift in retrieval relevance and hallucination rates.
-
Hard escalation thresholds
Static confidence cutoffs cause premature handoffs or delayed human intervention. Fix: Dynamic routing using multi-signal scoring (intent confidence, retrieval relevance, sentiment, session length). Escalate when any signal crosses threshold, not just confidence.
-
Ignoring multi-turn state
Support is conversational. Stateless prompts lose context, repeat questions, and frustrate users. Fix: Maintain session windows (last 4β6 turns), implement conversation summarization for long threads, and use a state machine to track resolution progress.
-
Treating AI as a cost center only
Missed opportunity for operational intelligence. Fix: Log intent distribution, resolution paths, knowledge gaps, and escalation reasons. Feed gaps back into knowledge ingestion. Use analytics to prioritize documentation updates and tool development.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume SaaS (10k+ tickets/mo) | RAG + hybrid search + streaming LLM | Maximizes deflection while maintaining sub-2s latency | -40% ticket cost, +15% infra |
| Regulated/FinTech | RAG + strict guardrails + deterministic routing | Compliance requires policy enforcement and audit trails | +20% validation cost, -60% risk exposure |
| Multilingual global support | Base LLM + language-specific retrieval + translation layer | Embeddings and policies vary by locale; translation preserves accuracy | +25% embedding cost, +30% resolution consistency |
| Low-volume internal support | Lightweight classifier + cached FAQ + manual escalation | Low volume doesn't justify complex RAG; simplicity reduces maintenance | -70% infra cost, +10% handle time |
Configuration Template
support_ai:
retrieval:
chunk_size: 800
chunk_overlap: 120
top_k: 5
hybrid_weights:
dense: 0.7
lexical: 0.3
reranker_model: cross-encoder/ms-marco-MiniLM-L-6-v2
routing:
intent_model: fasttext-support-intent
confidence_threshold: 0.75
escalation_signals: [low_confidence, policy_violation, sentiment_negative]
generation:
model: gpt-4o-mini
max_tokens: 500
temperature: 0.1
stream: true
first_token_timeout_ms: 1500
guardrails:
schema_enforcement: true
citation_required: true
policy_keywords: [refund_policy, sla_terms, compliance]
fallback: human_handoff
evaluation:
golden_set_size: 300
scoring_metrics: [accuracy, policy_compliance, tone, latency]
review_frequency: weekly
Quick Start Guide
- Initialize environment: Set
OPENAI_API_KEY, DB_URL, and RERANKER_ENDPOINT. Install dependencies: npm i @libsql/client zod langchain.
- Run ingestion: Execute
chunkSupportDocs() on your KB, generate embeddings, and populate support_knowledge table with vector and metadata columns.
- Deploy pipeline: Start the intent router, connect retrieval to generation, and enable Zod validation. Configure streaming endpoint with 2s timeout.
- Validate: Run 50 golden queries. Verify FCR >70%, hallucination <3%, and first token <1.5s. Adjust hybrid weights and confidence thresholds based on results.
- Monitor: Enable logging for intent distribution, retrieval scores, and escalation triggers. Schedule weekly evaluation runs to track drift and knowledge gaps.