transfer encoding with NDJSON payloads, which provides maximum compatibility across load balancers, CDNs, and serverless runtimes.
Step 1: Protocol Selection
Use HTTP/1.1 or HTTP/2 chunked transfer encoding. Avoid WebSockets for simple streaming: they require stateful connections, complicate proxy routing, and offer no latency advantage over chunked HTTP. SSE (Server-Sent Events) is viable for one-way server push but adds parsing overhead and lacks native bidirectional control. NDJSON over chunked HTTP strikes the optimal balance: stateless, cacheable at the edge, and natively supported by the Fetch API.
Step 2: Client-Side Stream Consumer
The browser's ReadableStream API handles incremental data. Combine it with AbortController for cancellation and backpressure handling.
interface StreamChunk {
id: string;
object: string;
created: number;
model: string;
choices: Array<{
index: number;
delta: { role?: string; content?: string };
finish_reason: string | null;
}>;
}
export class LLMStreamClient {
private abortController: AbortController | null = null;
async generate(
endpoint: string,
payload: Record<string, unknown>,
onChunk: (content: string) => void,
onComplete: () => void,
onError: (error: Error) => void
): Promise<void> {
this.abortController = new AbortController();
try {
const response = await fetch(endpoint, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Accept': 'application/json',
},
body: JSON.stringify({ ...payload, stream: true }),
signal: this.abortController.signal,
});
if (!response.ok || !response.body) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
const reader = response.body.getReader();
const decoder = new TextDecoder('utf-8');
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Split on newline boundaries to handle partial JSON objects
const lines = buffer.split('\n');
buffer = lines.pop() || '';
for (const line of lines) {
const trimmed = line.trim();
if (!trimmed || trimmed === 'data: [DONE]') continue;
// Strip SSE prefix if present
const jsonStr = trimmed.startsWith('data: ') ? trimmed.slice(6) : trimmed;
try {
const chunk: StreamChunk = JSON.parse(jsonStr);
const content = chunk.choices?.[0]?.delta?.content;
if (content) onChunk(content);
} catch {
// Skip malformed chunks; do not break stream
continue;
}
}
}
onComplete();
} catch (err) {
if (err instanceof Error && err.name === 'AbortError') {
// Expected cancellation
return;
}
onError(err instanceof Error ? err : new Error(String(err)));
} finally {
this.abortController = null;
}
}
cancel(): void {
this.abortController?.abort();
}
}
Step 3: Server-Side Relay (Optional but Recommended)
Direct client-to-LLM calls expose API keys and bypass rate limiting. A lightweight relay handles authentication, cost tracking, and stream sanitization.
// Node.js / Express example
import { Response } from 'express';
export function streamRelay(req: Request, res: Response) {
const { model, messages, max_tokens, temperature } = req.body;
res.setHeader('Content-Type', 'application/json');
res.setHeader('Transfer-Encoding', 'chunked');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
const proxyStream = async () => {
const llmRes = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.LLM_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ model, messages, max_tokens, temperature, stream: true }),
});
if (!llmRes.body) {
res.status(502).end(JSON.stringify({ error: 'Upstream stream missing' }));
return;
}
const reader = llmRes.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
res.write(chunk);
}
res.end();
};
proxyStream().catch(err => {
console.error('Stream relay error:', err);
if (!res.headersSent) res.status(500).end();
});
}
Step 4: Architecture Rationale
- Incremental Flushing: HTTP chunked encoding allows the server to send bytes as soon as tokens are generated. No buffering at the application layer.
- Stateless Scaling: Because streaming relies on standard HTTP, horizontal scaling works identically to blocking endpoints. Load balancers route new requests without sticky sessions.
- Backpressure Handling: The
reader.read() loop naturally applies backpressure. If the UI cannot render fast enough, the stream pauses until the consumer catches up.
- Cancellation Safety:
AbortController terminates the TCP connection immediately, preventing wasted compute and token billing.
Pitfall Guide
- Ignoring Backpressure: Feeding raw stream data directly to DOM updates causes layout thrashing and OOM crashes. Buffer chunks and throttle UI updates using
requestAnimationFrame or a microtask queue.
- Treating Streaming as Cost Reduction: Streaming does not reduce token consumption or inference compute. Cost remains identical to synchronous calls. Only cancel mid-stream to save tokens.
- Poor Cancellation Handling: Failing to abort connections leaves upstream models generating useless tokens. Always pair UI cancel buttons with
AbortController.abort() and log cancellation events for billing reconciliation.
- Assuming Token-to-Character Mapping: LLMs emit subword tokens. Streaming raw tokens produces broken markdown, split emojis, and incomplete code blocks. Implement incremental markdown parsing or use a library like
react-markdown with streaming support.
- Skipping Incremental Safety Checks: Streaming bypasses batch validation. Inject prompts, toxicity, or PII can leak incrementally. Apply lightweight streaming filters or run post-chunk validation before rendering.
- Blocking the Main Thread: JSON parsing and string concatenation on the main thread cause jank. Offload stream decoding to a Web Worker and communicate via
postMessage.
- Inconsistent Error Boundaries: Network drops mid-stream leave UI in half-rendered states. Implement retry logic with exponential backoff for transient failures, and always render a fallback state when
finish_reason is missing or stream terminates unexpectedly.
Production Best Practices:
- Monitor TTFT and TPOT (Time Per Output Token) separately. TTFT indicates model loading/cache hit rate; TPOT indicates inference throughput.
- Use HTTP/2 or HTTP/3 to reduce connection overhead and improve multiplexing.
- Implement chunk deduplication if upstream providers resend partial tokens.
- Log stream termination reasons (
stop, length, content_filter) for analytics and compliance.
- Never trust raw stream output for critical business logic; always validate the final assembled response.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time chat UI | Chunked HTTP + NDJSON + Web Worker parsing | Lowest latency, native browser support, easy proxying | Neutral (same tokens) |
| High-throughput batch API | Synchronous with connection pooling | Predictable billing, simpler error handling, no stream overhead | Lower infra complexity |
| Mobile/low-bandwidth clients | SSE with gzip compression + progressive rendering | Better compression ratios, native mobile HTTP clients support it | Slightly higher CPU for compression |
| Enterprise compliance gate | Server-side relay with streaming sanitizer | Intercepts PII/toxicity before client render, maintains audit trail | +10β15% latency for validation |
Configuration Template
// stream.config.ts
export const STREAM_CONFIG = {
// Endpoint routing
endpoint: process.env.LLM_STREAM_ENDPOINT || '/api/v1/chat/stream',
// Network behavior
timeout: 30000, // ms. Abort if no chunk received
retryAttempts: 2,
retryDelay: 800, // ms
// UI rendering
chunkThrottle: 16, // ms. Matches 60fps render cycle
maxBufferSize: 50, // chunks before forcing flush
// Telemetry
metrics: {
ttft: true,
tpot: true,
cancellation: true,
finishReason: true,
},
// Safety
incrementalFilter: false, // Enable if using streaming sanitizer
maxTokens: 2048,
temperature: 0.7,
};
export type StreamConfig = typeof STREAM_CONFIG;
Quick Start Guide
- Enable streaming in your payload: Add
stream: true to your LLM request body. Verify the provider returns Transfer-Encoding: chunked and Content-Type: application/json.
- Initialize the client: Instantiate
LLMStreamClient, bind onChunk to your UI state updater, and attach onComplete/onError handlers.
- Wire cancellation: Tie your UI's stop/cancel button to
client.cancel(). Ensure route changes or component unmounts trigger abort.
- Deploy relay (optional): If managing keys or compliance, route through the Express relay. Set
LLM_API_KEY in environment and verify chunk passthrough.
- Observe: Instrument TTFT and TPOT. Run a load test with 50 concurrent streams. Verify memory stays flat and cancellation terminates upstream generation within 200ms.