ing integration.
Step 1: Model Preparation and Quantization
Select model weights compatible with production runtimes. Convert to GGUF or use safetensors with quantization only when quality requirements permit. For reasoning-heavy workloads, keep FP16/BF16. For chat/summarization, INT8 or FP8 reduces memory pressure by 40β60% with minimal degradation.
# Example: Convert to FP8 using transformers
python convert_to_fp8.py --model meta-llama/Llama-3.1-8B --output-dir ./fp8_weights
Step 2: Inference Server Configuration
Deploy vLLM with PagedAttention and continuous batching enabled. These features eliminate KV cache fragmentation and allow dynamic request interleaving.
vllm serve meta-llama/Llama-3.1-8B \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 256 \
--enable-chunked-prefill \
--disable-log-requests
Key parameters:
--gpu-memory-utilization 0.9: Leaves 10% headroom for framework overhead and prevents OOM during peak KV cache growth.
--max-num-seqs 256: Caps concurrent sequences to match VRAM limits. Exceeding this causes silent drops.
--enable-chunked-prefill: Splits long prompts into manageable chunks, reducing initial latency spikes.
Step 3: Containerization and Orchestration
Package the inference server and deploy via Kubernetes. Use custom metrics for autoscaling instead of default CPU/memory targets.
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["serve", "meta-llama/Llama-3.1-8B", "--gpu-memory-utilization", "0.9"]
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "75"
Step 4: TypeScript Streaming Client
Implement backpressure-aware streaming with retry logic and token budget enforcement.
import { createParser } from 'eventsource-parser';
export async function streamLLMRequest(
prompt: string,
maxTokens: number = 512,
baseUrl: string = 'http://vllm-inference:8000/v1'
): Promise<AsyncIterable<string>> {
const response = await fetch(`${baseUrl}/chat/completions`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'meta-llama/Llama-3.1-8B',
messages: [{ role: 'user', content: prompt }],
max_tokens: maxTokens,
stream: true,
temperature: 0.7,
}),
});
if (!response.ok) {
throw new Error(`LLM request failed: ${response.status} ${response.statusText}`);
}
if (!response.body) {
throw new Error('No response body for streaming');
}
const parser = createParser((event) => {
if (event.data === '[DONE]') return;
try {
const json = JSON.parse(event.data);
const content = json.choices[0]?.delta?.content;
if (content) queue.push(content);
} catch { /* malformed chunk, skip */ }
});
const queue: string[] = [];
let resolved = false;
const reader = response.body.getReader();
const decoder = new TextDecoder();
async function* generator(): AsyncIterable<string> {
while (!resolved) {
if (queue.length > 0) {
yield queue.shift()!;
continue;
}
const { done, value } = await reader.read();
if (done) {
resolved = true;
break;
}
parser.feed(decoder.decode(value, { stream: true }));
}
while (queue.length > 0) {
yield queue.shift()!;
}
}
return generator();
}
// Usage
(async () => {
const stream = await streamLLMRequest('Explain PagedAttention in 3 sentences.');
for await (const chunk of stream) {
process.stdout.write(chunk);
}
})();
Architecture Decisions and Rationale
- vLLM over TGI/Triton: vLLMβs PagedAttention reduces KV cache fragmentation by 30β50% and provides native continuous batching. TGI requires manual batching configuration; Triton adds complexity without proportional gains for single-model serving.
- Kubernetes HPA with GPU metrics: CPU-based scaling fails for LLMs because GPU utilization dictates throughput. Custom metrics (via DCGM exporter) align scaling with actual inference capacity.
- Streaming-first client design: Token-by-token delivery reduces perceived latency and enables early cancellation. Backpressure handling prevents client memory exhaustion during long generations.
- Request routing layer: Place a lightweight proxy (Envoy or custom Go/TS service) in front of inference pods to handle fallback routing, rate limiting, and token budget enforcement without modifying the inference runtime.
Pitfall Guide
-
Ignoring KV Cache Memory Limits
Every token generated expands the KV cache. Without --gpu-memory-utilization caps or sequence limits, VRAM fills linearly and triggers OOM kills. Best practice: enforce max_num_seqs, monitor gpu_cache_usage_pct, and implement request rejection when cache exceeds 85%.
-
Naive Request Batching
Grouping requests by arrival time rather than sequence length causes head-of-line blocking. Long prompts delay short completions. Best practice: enable continuous batching (--enable-chunked-prefill) and prioritize requests by estimated token budget.
-
Over-Quantizing Reasoning Models
FP8/INT8 quantization compresses weights but degrades multi-step reasoning and code generation accuracy. Best practice: benchmark quantized vs. full precision on domain-specific eval sets before production rollout. Reserve quantization for classification, summarization, and chat.
-
Static GPU Allocation
Fixed replica counts waste budget during low traffic and bottleneck during spikes. Best practice: use HPA with GPU utilization and queue depth metrics. Set scale-up thresholds at 70β75% and scale-down at 30% with a 300s stabilization window.
-
Missing Streaming Backpressure
Unbounded streaming clients accumulate tokens in memory, causing heap bloat and crashes. Best practice: implement chunk limits, pause/resume logic, and explicit cancellation on client disconnect.
-
No Request Routing or Fallback
Single-model deployments fail silently when a provider endpoint degrades or a custom model drifts. Best practice: deploy a routing layer with health checks, token budget validation, and automatic fallback to secondary models or cached responses.
-
Inadequate Observability
Tracking only request count and latency misses the root causes of inference failures. Best practice: export gpu_utilization, kv_cache_usage, queue_depth, time_to_first_token, and tokens_per_second. Alert on cache fragmentation >20% and TTFT >800ms.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High throughput, cost-sensitive | Dedicated GPU Cluster (vLLM) | Continuous batching and PagedAttention maximize VRAM efficiency | 60β70% lower than serverless at scale |
| Low latency, variable traffic | Serverless Managed Inference | Auto-provisioning eliminates cold start management | 3β4x higher per-token cost above 50 req/s |
| Edge/IoT, offline compliance | Quantized GGUF + llama.cpp | Runs on CPU/low-tier GPU, no cloud dependency | Lowest infrastructure cost, capped throughput |
| Multi-model routing, fallback | vLLM + Envoy/Custom Proxy | Centralized routing handles degradation and token budgets | Adds 5β8% infra cost, prevents single-point failures |
| Rapid prototyping, internal tools | Managed API + SDK | Zero infra overhead, predictable pricing | Acceptable for <10k req/day, scales poorly beyond |
Configuration Template
# vllm-production-values.yaml (Helm-style overrides)
replicaCount: 2
image:
repository: vllm/vllm-openai
tag: latest
pullPolicy: IfNotPresent
args:
- serve
- meta-llama/Llama-3.1-8B
- --gpu-memory-utilization
- "0.9"
- --max-num-seqs
- "256"
- --enable-chunked-prefill
- --disable-log-requests
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
cpu: "4"
memory: "16Gi"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetGPUUtilization: 75
targetQueueDepth: 120
monitoring:
enabled: true
exporter: dcgm
metrics:
- gpu_utilization
- kv_cache_usage_pct
- queue_depth
- time_to_first_token
- tokens_per_second
Quick Start Guide
-
Pull and run vLLM locally:
docker run --runtime=nvidia --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
serve meta-llama/Llama-3.1-8B --gpu-memory-utilization 0.9
-
Verify inference endpoint:
curl http://localhost:8000/v1/models
-
Test streaming with TypeScript:
Use the streamLLMRequest function from Core Solution. Run with node --experimental-fetch stream-client.ts.
-
Deploy to Kubernetes:
Apply the vllm-deployment.yaml and HPA manifest. Ensure nvidia-device-plugin is installed and DCGM exporter is running for custom metrics.
-
Validate scaling:
Generate load with k6 or wrk. Monitor gpu_utilization and queue_depth. Confirm HPA scales pods within 60β90 seconds of threshold breach.
Deployment strategy is not a static choice; it is a runtime contract between workload characteristics and infrastructure capacity. Match precision to task, batch continuously, scale on GPU metrics, and instrument everything. The models will scale themselves; your infrastructure must keep pace.