quest ingestion from inference execution, enforcing token-aware grouping, and implementing resilient result mapping. The architecture follows a three-phase pipeline: ingestion, batching, and execution.
Step-by-Step Implementation
- Ingestion & Normalization: Accept requests, extract payloads, estimate token counts, and assign idempotency keys.
- Token-Aware Batching: Group requests by token budget, respecting provider batch limits. Split or defer when limits are approached.
- Queue & Worker Execution: Push batches to a persistent queue. Workers poll, send to provider, poll for completion, and map results back to original request IDs.
- Result Aggregation & Delivery: Return responses, handle partial failures, and trigger downstream callbacks or webhooks.
Architecture Decisions & Rationale
- Queue over In-Memory: Redis-backed queues (BullMQ, RabbitMQ) provide durability, backpressure, and horizontal scaling. In-memory arrays lose state on restart and cannot handle distributed deployments.
- Token-Aware Chunking: Providers reject batches that exceed context or token limits. Counting tokens upfront prevents silent truncation and 400-level failures.
- Async Polling Semantics: Provider batch APIs return immediately with a batch ID. Workers must poll status endpoints until completion, then fetch results. This matches provider design and avoids blocking.
- Idempotency & Result Mapping: Every request receives a unique ID. Batch results are keyed to these IDs, enabling exact reconstruction even when batches fail partially or retry.
TypeScript Implementation
import { Queue, Worker, Job } from 'bullmq';
import { createClient } from 'redis';
import { TiktokenEncoding, getEncoding } from 'tiktoken';
interface LLMRequest {
id: string;
messages: Array<{ role: string; content: string }>;
model: string;
maxTokens: number;
callbackUrl?: string;
}
interface BatchResult {
requestId: string;
content: string;
status: 'success' | 'failed';
error?: string;
}
const ENCODING: TiktokenEncoding = 'cl100k_base';
const BATCH_TOKEN_LIMIT = 100_000;
const PROVIDER_BATCH_ENDPOINT = 'https://api.provider.com/v1/batches';
class LLMBatchProcessor {
private queue: Queue;
private worker: Worker;
private encoder: any;
constructor(redisUrl: string) {
const connection = createClient({ url: redisUrl });
connection.connect();
this.queue = new Queue('llm-batch', { connection });
this.encoder = getEncoding(ENCODING);
this.worker = new Worker(
'llm-batch',
async (job: Job) => this.processBatch(job),
{ connection, concurrency: 3 }
);
}
async enqueue(requests: LLMRequest[]): Promise<void> {
const tokenEstimates = requests.map(r => this.estimateTokens(r));
const batches: LLMRequest[][] = [];
let currentBatch: LLMRequest[] = [];
let currentTokens = 0;
for (let i = 0; i < requests.length; i++) {
const req = requests[i];
const tokens = tokenEstimates[i];
if (currentTokens + tokens > BATCH_TOKEN_LIMIT && currentBatch.length > 0) {
batches.push(currentBatch);
currentBatch = [];
currentTokens = 0;
}
currentBatch.push(req);
currentTokens += tokens;
}
if (currentBatch.length > 0) batches.push(currentBatch);
await this.queue.addBulk(
batches.map(batch => ({
name: 'execute-batch',
data: { batch, timestamp: Date.now() },
opts: { attempts: 3, backoff: { type: 'exponential', delay: 2000 } }
}))
);
}
private estimateTokens(request: LLMRequest): number {
const text = request.messages.map(m => m.content).join(' ');
return this.encoder.encode(text).length + request.maxTokens;
}
private async processBatch(job: Job): Promise<BatchResult[]> {
const { batch } = job.data;
const batchPayload = {
input_file_id: await this.uploadBatchFile(batch),
endpoint: '/v1/chat/completions',
completion_window: '24h',
metadata: { batchId: job.id }
};
const batchResponse = await fetch(PROVIDER_BATCH_ENDPOINT, {
method: 'POST',
headers: { 'Content-Type': 'application/json', 'Authorization': `Bearer ${process.env.LLM_API_KEY}` },
body: JSON.stringify(batchPayload)
});
if (!batchResponse.ok) throw new Error(`Batch submission failed: ${batchResponse.statusText}`);
const { id: batchId } = await batchResponse.json();
let status = 'validating';
while (status === 'validating' || status === 'in_progress') {
const statusRes = await fetch(`${PROVIDER_BATCH_ENDPOINT}/${batchId}`);
const data = await statusRes.json();
status = data.status;
await new Promise(res => setTimeout(res, 5000));
}
if (status === 'failed') {
throw new Error(`Batch ${batchId} failed processing`);
}
const resultsRes = await fetch(data.output_file_url);
const results = await resultsRes.json();
return batch.map(req => {
const match = results.find((r: any) => r.custom_id === req.id);
return {
requestId: req.id,
content: match?.content || '',
status: match ? 'success' : 'failed',
error: match?.error || undefined
};
});
}
private async uploadBatchFile(batch: LLMRequest[]): Promise<string> {
// Serialize batch to NDJSON, upload to provider file API, return file_id
// Implementation omitted for brevity; matches provider file upload spec
return 'file-placeholder-id';
}
}
export default LLMBatchProcessor;
The implementation enforces token budgeting, leverages BullMQ for durable queuing, implements exponential backoff, and maps results back to original request IDs. Production deployments should replace placeholder file upload logic with provider-specific NDJSON serialization and file API calls.
Pitfall Guide
-
Fixed-Size Batching Without Token Counting: Grouping requests by count (e.g., 50 per batch) ignores variable payload sizes. This causes context window overflow, silent truncation, or provider rejection. Always count tokens before batching.
-
Ignoring Provider Batch Limits: Each provider enforces maximum tokens, requests, or file size per batch. Exceeding these limits rejects the entire batch. Validate against documented limits before submission.
-
Missing Request-to-Result Mapping: Batch APIs return aggregated results. Without idempotency keys or custom IDs, you cannot reconstruct which response belongs to which request. This breaks stateful workflows and causes data corruption.
-
Synchronous Blocking in Workers: Polling provider status endpoints synchronously blocks the event loop or thread pool. Use async intervals or webhooks. Blocking workers destroy throughput and trigger timeout cascades.
-
No Partial Failure Handling: Batches frequently succeed partially. Treating a batch as all-or-nothing forces full retries, wasting tokens and latency. Track per-request status, retry only failed items, and surface granular errors.
-
Overlooking Circuit Breakers: Provider outages or rate limit spikes can trap workers in retry loops. Implement circuit breakers with fallback queues, dead-letter handling, and alerting to prevent resource exhaustion.
-
Untracked Cost Attribution: Batch processing obscures per-request spend. Without tagging requests with service, user, or environment metadata, cost reconciliation becomes impossible. Embed cost centers in batch metadata and log token consumption per request.
Production Best Practices:
- Use adaptive batch sizing that respects both token limits and concurrency caps
- Implement OpenTelemetry tracing across ingestion, batching, and execution phases
- Separate interactive latency-sensitive flows from deferred batch workloads
- Store batch state in a durable database for auditability and replay
- Design webhooks or polling consumers to handle async completion gracefully
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Interactive UI requiring <2s response | Async queue + short-lived batch window | Preserves UX while offloading inference to workers | Moderate increase due to queue overhead |
| ETL/data pipeline processing 100k+ records | Native batch endpoint | Maximizes throughput and leverages provider discounts | Lowest cost per token |
| Cost-constrained analytics with mixed latency | Queue-driven dynamic batching | Balances throughput, cost, and predictable latency | 10β15% savings vs sequential |
| Multi-tenant SaaS with unpredictable spikes | Queue + circuit breaker + dead-letter queue | Prevents cascading failures and isolates noisy tenants | Higher infra cost, lower risk exposure |
Configuration Template
llm_batch_processor:
redis:
url: ${REDIS_URL}
max_retries: 3
backoff_delay_ms: 2000
batching:
max_tokens_per_batch: 100000
max_requests_per_batch: 50
token_estimator: tiktoken_cl100k
provider:
endpoint: https://api.provider.com/v1/batches
api_key: ${LLM_API_KEY}
completion_window: 24h
execution:
worker_concurrency: 3
poll_interval_ms: 5000
enable_webhooks: false
webhook_url: ${BATCH_WEBHOOK_URL}
observability:
metrics_prefix: llm.batch
trace_id_header: x-batch-trace-id
log_level: info
Quick Start Guide
- Provision dependencies: Run
npm install bullmq redis tiktoken and start a Redis instance locally or via managed service.
- Configure environment: Set
REDIS_URL and LLM_API_KEY in your .env file. Adjust max_tokens_per_batch to match your provider's limits.
- Initialize processor: Instantiate
LLMBatchProcessor with your Redis URL and call enqueue() with an array of normalized LLMRequest objects.
- Monitor execution: Tail BullMQ job logs, track token throughput in your metrics dashboard, and verify result mapping against original request IDs.