try to your observability stack. The implementation must be non-blocking, streaming-aware, and decoupled from business logic.
Step 1: Design the Interception Layer
Place cost tracking in a middleware or HTTP client wrapper. This ensures every LLM invocation passes through a single enforcement point. Avoid scattering tracking logic across services, as this creates inconsistent metadata and makes aggregation impossible.
LLM SDKs return token counts in the response payload. For standard completions, this is straightforward. For streaming responses, you must accumulate tokens from the final chunk or use SDK-specific streaming utilities. Always capture input tokens, output tokens, model identifier, and request metadata.
Step 3: Calculate Costs Dynamically
Pricing changes frequently. Hardcoding rates creates technical debt. Instead, maintain a pricing registry that supports runtime updates, fallback tiers, and cache-aware pricing. Calculate costs using the standard per-million-token formula.
Step 4: Emit Structured Telemetry
Send cost data to your analytics pipeline asynchronously. Use an event queue or message bus to avoid blocking the request path. Include trace IDs, agent identifiers, and feature flags to enable downstream slicing.
Implementation (TypeScript)
import { EventEmitter } from 'events';
import { OpenAI } from 'openai';
import type { ChatCompletion, ChatCompletionChunk } from 'openai/resources';
interface PricingTier {
inputPerMillion: number;
outputPerMillion: number;
cacheHitDiscount?: number;
}
interface CostRecord {
agentId: string;
model: string;
inputTokens: number;
outputTokens: number;
cost: number;
timestamp: Date;
traceId: string;
metadata: Record<string, unknown>;
}
class PricingRegistry {
private rates: Record<string, PricingTier> = {
'gpt-4-turbo': { inputPerMillion: 10.0, outputPerMillion: 30.0 },
'gpt-4': { inputPerMillion: 30.0, outputPerMillion: 60.0 },
'gpt-3.5-turbo': { inputPerMillion: 0.5, outputPerMillion: 1.5 },
'gpt-4-mini': { inputPerMillion: 0.15, outputPerMillion: 0.60 },
};
getRate(model: string): PricingTier {
return this.rates[model] ?? { inputPerMillion: 0, outputPerMillion: 0 };
}
updateRate(model: string, tier: PricingTier): void {
this.rates[model] = tier;
}
}
class LLMCostInterceptor {
private pricing: PricingRegistry;
private telemetryQueue: EventEmitter;
constructor() {
this.pricing = new PricingRegistry();
this.telemetryQueue = new EventEmitter();
this.telemetryQueue.on('cost-record', this.persistCost.bind(this));
}
calculateCost(model: string, inputTokens: number, outputTokens: number): number {
const tier = this.pricing.getRate(model);
const inputCost = (inputTokens / 1_000_000) * tier.inputPerMillion;
const outputCost = (outputTokens / 1_000_000) * tier.outputPerMillion;
return inputCost + outputCost;
}
async interceptCompletion(
agentId: string,
traceId: string,
client: OpenAI,
params: Parameters<OpenAI['chat']['completions']['create']>[0]
): Promise<ChatCompletion> {
const startTime = new Date();
const response = await client.chat.completions.create(params);
const usage = response.usage;
if (!usage) {
throw new Error('LLM response missing usage metadata');
}
const cost = this.calculateCost(response.model, usage.prompt_tokens, usage.completion_tokens);
const record: CostRecord = {
agentId,
model: response.model,
inputTokens: usage.prompt_tokens,
outputTokens: usage.completion_tokens,
cost,
timestamp: startTime,
traceId,
metadata: {
temperature: params.temperature,
maxTokens: params.max_tokens,
stream: false,
},
};
this.telemetryQueue.emit('cost-record', record);
return response;
}
private persistCost(record: CostRecord): void {
// Route to OpenTelemetry, Datadog, CloudWatch, or time-series DB
console.log(JSON.stringify({
event: 'llm.cost.tracked',
...record,
hour: record.timestamp.toISOString().slice(0, 13)
}));
}
}
Architecture Rationale
- Middleware over scattered logging: Centralizing interception guarantees consistent metadata injection and eliminates drift between services.
- Async telemetry emission: Cost calculation and persistence must never block the critical path. Using an event queue decouples tracking from request latency.
- Dynamic pricing registry: LLM providers adjust rates monthly. A registry pattern allows hot-reloading without deployments and supports cache-aware pricing tiers.
- Trace ID propagation: Embedding distributed tracing identifiers enables correlation between cost records and application logs, making root-cause analysis deterministic.
Pitfall Guide
1. Static Pricing Hardcoding
Explanation: Embedding token rates directly in business logic creates stale data when providers update pricing. Teams often discover discrepancies weeks after a rate change.
Fix: Implement a pricing registry with external configuration (environment variables, feature flags, or a lightweight config service). Schedule periodic validation against provider documentation.
2. Ignoring Prompt Caching Costs
Explanation: Modern LLM APIs offer discounted rates for cached prompt prefixes. Tracking only raw token counts overstates costs by 15β40% for repetitive workflows.
Fix: Extract cache_creation_input_tokens and cache_read_input_tokens from the response payload. Apply discounted rates to cached tokens and log cache hit ratios for optimization.
3. Streaming Response Blind Spots
Explanation: Streaming completions do not return usage metadata until the final chunk. Naive implementations either skip tracking or double-count tokens.
Fix: Accumulate tokens from the done event or use SDK-specific streaming utilities that expose final usage. Ensure the interceptor waits for stream completion before emitting cost records.
4. Alert Threshold Rigidity
Explanation: Fixed thresholds (e.g., "alert at $50/hour") generate false positives during traffic spikes and miss gradual degradation.
Fix: Implement adaptive baselines using rolling windows. Alert when hourly spend exceeds the pro-rata daily budget, or when per-agent spend deviates >3Ο from a 14-day moving average.
5. Synchronous Cost Calculation
Explanation: Performing pricing lookups and database writes on the request path adds 10β50ms of latency per LLM call, which compounds across high-throughput agents.
Fix: Offload persistence to an async queue or message broker. Use in-memory aggregation for real-time budget checks, and batch-write to time-series storage every 5β10 seconds.
6. Missing Contextual Metadata
Explanation: Cost records without trace IDs, feature flags, or user segments become unqueryable aggregates. Engineering cannot slice data to answer "which customer tier drives premium model usage?"
Fix: Enforce metadata propagation at the API gateway or agent orchestrator. Require traceId, agentId, and tenantId as mandatory fields before emitting telemetry.
7. Timezone Fragmentation
Explanation: Storing timestamps in local timezones breaks hourly aggregation and makes cross-region comparison impossible.
Fix: Normalize all timestamps to UTC at ingestion. Store hour-level buckets as ISO 8601 prefixes (2024-03-15T14) to enable efficient time-series partitioning.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small team, single agent | Middleware interceptor + async queue | Low overhead, fast deployment, sufficient visibility | Minimal engineering cost, immediate anomaly detection |
| Multi-tenant SaaS, 10+ agents | Stateful budget proxy + OpenTelemetry export | Centralized enforcement, tenant isolation, standardized tracing | Higher infra cost, prevents runaway spend across tenants |
| High-throughput batch processing | SDK wrapper + batch telemetry writer | Optimized for throughput, reduces network calls, aligns with async pipelines | Reduces telemetry overhead by 60%, improves batch cost accuracy |
| Experimental/Research workloads | Lightweight decorator + local log aggregation | Fast iteration, no infra dependencies, easy teardown | Low setup cost, acceptable for non-production environments |
Configuration Template
// telemetry.config.ts
export const LLM_TELEMETRY_CONFIG = {
pricing: {
refreshIntervalMs: 3_600_000, // 1 hour
fallbackModel: 'gpt-3.5-turbo',
cacheDiscountMultiplier: 0.5,
},
telemetry: {
batchSize: 100,
flushIntervalMs: 5_000,
transport: 'opentelemetry', // or 'datadog', 'cloudwatch', 'postgres'
},
alerting: {
proRataThreshold: 0.3, // Alert if 30% of daily budget spent in first 8.3% of day
deviationSigma: 3,
modelsToMonitor: ['gpt-4', 'gpt-4-turbo', 'claude-3-opus'],
},
enforcement: {
dailyBudgetPerAgent: {
'customer-support-v2': 15.00,
'code-review-pipeline': 45.00,
'data-extraction-worker': 8.00,
},
actionOnExceed: 'block_and_alert', // or 'throttle', 'log_only'
},
};
Quick Start Guide
- Instrument your primary LLM client: Replace direct SDK calls with the
LLMCostInterceptor wrapper. Pass agentId and traceId from your request context.
- Configure the pricing registry: Load initial rates from environment variables or a config file. Enable automatic refresh to stay aligned with provider updates.
- Route telemetry to your stack: Point the async queue to OpenTelemetry, Datadog, or a Postgres table. Ensure timestamps are UTC and partitioned by day.
- Deploy budget enforcement: Add a middleware check that compares cumulative daily spend against
dailyBudgetPerAgent. Block requests and emit alerts when thresholds are breached.
- Validate with a test run: Trigger a known agent workflow. Verify that cost records appear in your analytics dashboard within 5 seconds, and confirm that hourly aggregation matches expected token counts.
Granular LLM cost tracking is not an accounting exercise. It is an engineering control surface that directly impacts architecture, performance, and financial predictability. By intercepting calls at the middleware layer, calculating costs against dynamic pricing, and enforcing budgets in real time, you transform opaque API invoices into actionable telemetry. The agents that previously burned budget in the dark become visible, optimizable, and accountable.