odel tier selection logic. The following TypeScript implementation demonstrates a production-ready pattern using the official @anthropic-ai/sdk.
Step 1: Token Accounting & Usage Interception
Every API response includes a usage object. Intercepting this data at the client level ensures accurate billing attribution and enables real-time budget tracking.
import Anthropic from '@anthropic-ai/sdk';
interface TokenMetrics {
inputTokens: number;
outputTokens: number;
cacheWriteTokens: number;
cacheReadTokens: number;
model: string;
timestamp: Date;
}
class TokenTracker {
private metrics: TokenMetrics[] = [];
record(response: Anthropic.Message): void {
this.metrics.push({
inputTokens: response.usage.input_tokens,
outputTokens: response.usage.output_tokens,
cacheWriteTokens: response.usage.cache_creation_input_tokens ?? 0,
cacheReadTokens: response.usage.cache_read_input_tokens ?? 0,
model: response.model,
timestamp: new Date()
});
}
getSummary(): TokenMetrics[] {
return this.metrics;
}
}
Step 2: Cache-Aware Prompt Builder
Caching only yields ROI when applied to repeated prefixes exceeding ~1,024 tokens. The builder automatically injects cache_control markers for system instructions and static context blocks.
function buildCachedPrompt(systemInstruction: string, userMessage: string): {
system: Anthropic.MessageParam['content'];
messages: Anthropic.MessageParam[];
} {
return {
system: [
{
type: 'text',
text: systemInstruction,
cache_control: { type: 'ephemeral' }
}
],
messages: [{ role: 'user', content: userMessage }]
};
}
Step 3: Cost-Optimized Client Router
This class routes requests based on latency requirements, context repetition, and task complexity. It enforces max_tokens, tracks usage, and falls back to batch when appropriate.
import Anthropic from '@anthropic-ai/sdk';
type ModelTier = 'haiku' | 'sonnet' | 'opus';
interface RouteConfig {
model: ModelTier;
useCache: boolean;
useBatch: boolean;
maxOutputTokens: number;
}
class CostOptimizedClient {
private client: Anthropic;
private tracker: TokenTracker;
constructor(apiKey: string) {
this.client = new Anthropic({ apiKey });
this.tracker = new TokenTracker();
}
async routeRequest(
systemPrompt: string,
userMessage: string,
config: RouteConfig
): Promise<Anthropic.Message> {
const modelMap: Record<ModelTier, string> = {
haiku: 'claude-haiku-4-5',
sonnet: 'claude-sonnet-4-6',
opus: 'claude-opus-4-7'
};
const prompt = config.useCache
? buildCachedPrompt(systemPrompt, userMessage)
: { system: systemPrompt, messages: [{ role: 'user', content: userMessage }] };
const params: Anthropic.MessageCreateParams = {
model: modelMap[config.model],
max_tokens: config.maxOutputTokens,
...prompt
};
// Batch routing would use client.messages.batch.create() in production
// Here we simulate sync routing with cache awareness
const response = await this.client.messages.create(params);
this.tracker.record(response);
return response;
}
getUsageReport() {
return this.tracker.getSummary();
}
}
Architecture Decisions & Rationale
- Centralized Token Tracking: Decoupling usage logging from business logic prevents metric drift and enables consistent cost attribution across microservices.
- Explicit Cache TTL Awareness: The 5-minute ephemeral window means cache hits only occur within tight request bursts. In-memory routing works for single-instance deployments; distributed systems should pair cache markers with a Redis-backed session router to guarantee prefix matching.
- Model Tier Routing: Haiku handles classification, extraction, and short Q&A at 73% lower input cost than Sonnet. Opus should be gated behind explicit reasoning flags (e.g.,
requires_deep_analysis: true) to prevent accidental premium routing.
- Output Token Capping: Unbounded generation is the fastest path to budget overruns.
max_tokens must be enforced at the client layer, not left to default behavior.
Pitfall Guide
1. Unbounded Output Generation
Explanation: Omitting max_tokens allows the model to generate until it naturally stops, which can exceed 4,000+ tokens for complex prompts. Since output tokens cost 3–5× more than input, this silently inflates costs.
Fix: Always set max_tokens based on expected response length. Use streaming responses to monitor early token accumulation and abort if thresholds are breached.
2. Caching Short Contexts
Explanation: Cache writes carry a 25% premium over standard input pricing. If the cached prefix is under ~1,024 tokens, you need more than 4 cache hits to break even, which rarely occurs in low-frequency endpoints.
Fix: Enforce a minimum cache threshold. Only apply cache_control to system instructions, large document embeddings, or few-shot example blocks that repeat across requests.
3. Ignoring Cache Write Premium
Explanation: The first request to a cached prompt pays extra. Teams sometimes disable caching after seeing a cost spike on the initial call, missing the 90% savings on subsequent hits.
Fix: Pre-warm caches during off-peak hours or accept the one-time write cost for high-frequency endpoints. Monitor cache_creation_input_tokens to verify write frequency.
4. Over-Provisioning Model Capability
Explanation: Routing simple extraction or classification tasks to Opus or Sonnet wastes budget. Haiku 4.5 matches or exceeds higher-tier models for structured tasks at a fraction of the cost.
Fix: Implement a capability matrix. Route by task type: Haiku for classification/extraction, Sonnet for balanced reasoning, Opus only for multi-step logical deduction or complex code generation.
5. Silent Context Bloat
Explanation: Appending full conversation history to every API call causes input tokens to grow linearly with session length. A 20-turn chat can easily exceed 8,000 input tokens, multiplying costs per request.
Fix: Implement sliding windows or summary compression. Replace older turns with AI-generated summaries or truncate to the last N exchanges. Track input_tokens growth per session to detect drift.
6. Batch API Misuse
Explanation: Expecting real-time responses from the Batch API leads to broken UX. Batch jobs are queued and processed within a 24-hour window, optimized for throughput, not latency.
Fix: Design async workflows with webhooks, polling endpoints, or message queues. Reserve batch for overnight ETL, bulk content generation, and offline analysis pipelines.
7. Missing Cache Hit Metrics
Explanation: Deploying cache markers without tracking hit rates leaves optimization invisible. If hit rates stay below 30%, the cache write premium outweighs savings.
Fix: Log cache_read_input_tokens vs input_tokens. Set alerts when cache hit ratio drops below threshold. Adjust prompt structure or TTL routing to improve prefix matching.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time user chat | Sonnet 4.6 + Prompt Caching | Low latency required; repeated system prompts benefit from 90% input discount | ~$0.30/MTok input after cache warmup |
| Bulk document classification | Haiku 4.5 + Batch API | High volume, no latency constraint; Haiku handles extraction efficiently | ~$0.40/MTok input (batch rate) |
| Complex multi-step reasoning | Opus 4.7 + Sync | Task requires advanced logical deduction; latency acceptable for accuracy | $15.00/$75.00 per MTok (use sparingly) |
| Overnight data ETL pipeline | Sonnet 4.6 + Batch API | Large context windows processed asynchronously; 50% discount applies | $1.50/$7.50 per MTok |
| High-frequency API gateway | Sonnet 4.6 + Aggressive Caching | Repeated prefixes across thousands of requests maximize cache ROI | Break-even after 2 hits; 90% savings thereafter |
Configuration Template
// anthropic.config.ts
import Anthropic from '@anthropic-ai/sdk';
export const CLAUDE_CONFIG = {
apiKey: process.env.ANTHROPIC_API_KEY,
defaultModel: 'claude-sonnet-4-6',
maxOutputTokens: 1024,
cacheTTLSeconds: 300,
batchSLAHours: 24,
routing: {
haiku: {
model: 'claude-haiku-4-5',
useCases: ['classification', 'extraction', 'short_qa'],
inputCostPerMTok: 0.80,
outputCostPerMTok: 4.00
},
sonnet: {
model: 'claude-sonnet-4-6',
useCases: ['balanced_reasoning', 'code_assistance', 'general_chat'],
inputCostPerMTok: 3.00,
outputCostPerMTok: 15.00
},
opus: {
model: 'claude-opus-4-7',
useCases: ['complex_reasoning', 'multi_step_planning', 'advanced_code_gen'],
inputCostPerMTok: 15.00,
outputCostPerMTok: 75.00
}
},
cache: {
enabled: true,
minPrefixTokens: 1024,
type: 'ephemeral'
},
monitoring: {
trackInputTokens: true,
trackOutputTokens: true,
trackCacheWrites: true,
trackCacheReads: true,
alertThresholdPercent: 80
}
};
export const client = new Anthropic({ apiKey: CLAUDE_CONFIG.apiKey });
Quick Start Guide
- Install the SDK: Run
npm install @anthropic-ai/sdk and set ANTHROPIC_API_KEY in your environment.
- Initialize the client: Import the configuration template and instantiate
Anthropic. Attach a token tracking middleware to intercept usage objects.
- Apply cache markers: Wrap system instructions and static context blocks with
cache_control: { type: 'ephemeral' }. Ensure prefixes exceed 1,024 tokens for ROI.
- Route by workload: Use Haiku for classification/extraction, Sonnet for balanced tasks, and Opus only for complex reasoning. Queue non-urgent jobs to the Batch API for 50% cost reduction.
- Monitor & alert: Log
input_tokens, output_tokens, cache_creation_input_tokens, and cache_read_input_tokens. Set budget alerts at 80% of monthly forecast to catch drift before overages occur.