on Rate (%) |
|----------|-------------------|------------------|--------------------|---------------------|
| Frontier-Only | $8.20 | 250 | 4.1 | N/A |
| Routing-Only | $2.90 | 160 | 3.8 | N/A |
| Cascading-Only | $4.10 | 320 | 4.0 | 12.0 |
| Hybrid (Routing + Cascading + Cache) | $2.44 | 180 | 4.2 | 2.8 |
Key Findings:
- The hybrid approach achieves a 70% cost reduction ($8.20 β $2.44) while slightly improving quality scores.
- P95 latency drops by 28% due to semantic cache hits (30β40% hit rate) and optimized classifier routing (~5ms).
- An escalation rate of 2.8% indicates optimal calibration; rates above 5% signal classifier drift or insufficient domain training data.
Core Solution
The production-ready architecture operates on three distinct layers, each handling a specific decision point before LLM invocation:
Request
|
[Semantic Cache] -- hit --> Response (zero cost)
| miss
[Intent Classifier] (0.5B model, ~5ms)
|
|-- Simple --> DeepSeek V4-Pro ($0.435/1M)
|-- Medium --> GPT-4o-mini ($1.50/1M)
|-- Critical --> GPT-5.5 / Opus ($15-26/1M)
^
[Confidence Gate]
confidence < 0.70: escalate
Layer 1: Semantic Cache
Checks query embeddings against historical responses before any classification. For B2C or repetitive B2B workloads, a 30β40% hit rate is realistic, reducing marginal cost to zero.
Layer 2: Intent Classifier
A lightweight model (0.5B parameters) trained on actual workload distributions, not generic benchmarks. Deployed locally via vLLM, it adds <5ms latency and ~$0.20/hour GPU cost.
Layer 3: Confidence Gate
Each response returns a confidence score. Below 0.70 triggers automatic escalation; above 0.85 is trusted. High-stakes domains (finance, legal) bypass the gate and route directly to frontier.
Routing vs. Cascading Implementation
Routing is an upfront decision for structured workloads:
query = "Extract the cost values from document X"
tier = classifier.predict(query) # returns "simple"
response = router.call(tier, query) # DeepSeek, $0.435/1M
Cascading handles unpredictable workloads with confidence-based fallback:
response = deepseek.call(query)
if response.confidence < 0.70:
response = sonnet.call(query)
# Total cost: $0.435 + $5.00 = $5.435 vs. $26 going straight to Opus
Tooling & Rollout
LiteLLM manages multi-tier routing and fallback:
pip install litellm
from litellm import Router
router = Router(model_list=[
{"model_name": "tier-simple", "litellm_params": {"model": "deepseek/deepseek-v4-pro"}},
{"model_name": "tier-medium", "litellm_params": {"model": "gpt-4o-mini"}},
{"model_name": "tier-frontier", "litellm_params": {"model": "claude-opus-4"}},
])
RouteLLM provides calibration matrices trained on query history, routing 85% of traffic to cheap tiers while preserving 95% of frontier quality.
vLLM enables sub-5ms local classification:
pip install vllm
vllm serve Qwen/Qwen2.5-0.5B-Instruct --dtype auto
Four-Week Rollout:
Week 1: LiteLLM with 3 tiers + structured logging
Week 2: Confidence gate + domain overrides (finance and legal to frontier)
Week 3: Empirical threshold calibration via A/B test
Week 4: Monitor cost per task, escalation rate, quality score
Target at week 4: cost per task down at least 40%. If not, the classifier needs more domain-specific training data.
Pitfall Guide
- No Observability on Routing Decisions: Failing to log classifier scores, selected tiers, and final confidence per query causes silent calibration drift. Without telemetry, degradation goes undetected until quality or cost metrics spike.
- Single Provider Dependency: Relying exclusively on one vendor for the cheap tier creates systemic risk. If the provider experiences downtime, your cost-optimized routing collapses. Always configure same-tier fallbacks across multiple providers.
- Tail Miscalibration: Overall accuracy metrics (e.g., 94%) mask failure in the 6% tail. These are typically rare, high-stakes queries with minimal training data. Oversample tail cases during validation to prevent catastrophic misrouting.
- Cascade Latency Stacking: Sequential model calls compound latency. Three calls at 100ms each equal 300ms, which can degrade conversion rates or violate SLAs. In latency-sensitive flows, direct frontier invocation may be cheaper than the UX penalty.
- Thresholds Set by Intuition: Confidence gates (e.g., 0.70) must be empirically calibrated. Run A/B tests comparing thresholds (0.65 vs. 0.75) over a full week, measuring escalation rate, average quality, and cost per task. Optimal thresholds are workload-specific.
Deliverables
- Blueprint: LLM Routing Architecture Blueprint detailing the 3-layer design (Semantic Cache β Intent Classifier β Confidence Gate), provider fallback matrices, and domain override rules for finance/legal workloads.
- Checklist: Implementation & Calibration Checklist covering the 4-week rollout phases, A/B testing parameters for threshold validation, observability logging requirements, and escalation rate monitoring targets (<5%).
- Configuration Templates: Ready-to-deploy LiteLLM router configurations, vLLM local classifier serve commands, and RouteLLM calibration matrix setup scripts for immediate production integration.