ts explicitly.
import Anthropic from "@anthropic-ai/sdk";
import { APIConnectionError, RateLimitError, APIError } from "@anthropic-ai/sdk";
interface OrchestratorConfig {
model: string;
maxTokens: number;
timeoutMs: number;
maxRetries: number;
}
export class LLMOrchestrator {
private client: Anthropic;
private config: OrchestratorConfig;
constructor(config: Partial<OrchestratorConfig> = {}) {
this.config = {
model: "claude-sonnet-4-6",
maxTokens: 1024,
timeoutMs: 30000,
maxRetries: 3,
...config,
};
this.client = new Anthropic({
timeout: this.config.timeoutMs,
maxRetries: this.config.maxRetries,
});
}
}
Rationale: Explicit timeout and retry configuration prevents indefinite hanging during network degradation. The SDK's built-in retry logic handles transient failures, but capping it avoids infinite loops on persistent auth or quota issues.
2. Context-Aware Conversation Manager
Maintaining conversation history requires more than pushing to an array. Production systems must enforce token limits, serialize system prompts, and handle cache markers efficiently.
import type { MessageParam, Tool } from "@anthropic-ai/sdk";
export class ContextManager {
private history: MessageParam[] = [];
private systemPrompt: string;
constructor(systemInstruction: string) {
this.systemPrompt = systemInstruction;
}
append(role: "user" | "assistant", content: string): void {
this.history.push({ role, content });
}
getPayload(maxTokens: number): {
system: { type: "text"; text: string; cache_control: { type: "ephemeral" } }[];
messages: MessageParam[];
max_tokens: number;
} {
return {
system: [
{
type: "text",
text: this.systemPrompt,
cache_control: { type: "ephemeral" },
},
],
messages: [...this.history],
max_tokens: maxTokens,
};
}
clear(): void {
this.history = [];
}
}
Rationale: Encapsulating context prevents accidental mutation across concurrent requests. The cache_control marker is applied to the system prompt array, enabling Anthropic's prefix caching mechanism. Keeping the system prompt static across turns maximizes cache hit rates.
3. Streaming Execution Strategy
Streaming requires iterating over SSE events, buffering deltas, and extracting final usage metrics. The SDK's stream() method returns an async iterator that emits structured events.
export async function streamResponse(
orchestrator: LLMOrchestrator,
context: ContextManager,
onChunk: (text: string) => void
): Promise<Anthropic.Message> {
const payload = context.getPayload(orchestrator.config.maxTokens);
const stream = await orchestrator.client.messages.stream(payload);
for await (const event of stream) {
if (
event.type === "content_block_delta" &&
event.delta.type === "text_delta"
) {
onChunk(event.delta.text);
}
}
return await stream.finalMessage();
}
Rationale: Decoupling chunk emission via a callback enables UI integration without blocking the event loop. The finalMessage() call resolves only after the stream completes, providing accurate token usage and stop reasons.
Tool execution requires a multi-turn loop: the model requests a tool, the application executes it, and the result is fed back. This loop must terminate cleanly and handle malformed tool calls.
interface ToolDefinition {
name: string;
description: string;
schema: { type: "object"; properties: Record<string, any>; required: string[] };
}
export async function executeWithTools(
orchestrator: LLMOrchestrator,
context: ContextManager,
tools: ToolDefinition[],
toolHandlers: Record<string, (input: any) => Promise<string>>
): Promise<string> {
const anthropicTools: Tool[] = tools.map((t) => ({
name: t.name,
description: t.description,
input_schema: t.schema,
}));
let turnCount = 0;
const MAX_TURNS = 5;
while (turnCount < MAX_TURNS) {
const payload = {
...context.getPayload(orchestrator.config.maxTokens),
tools: anthropicTools,
};
const response = await orchestrator.client.messages.create(payload);
context.append("assistant", response.content.map((c) => c.type === "text" ? c.text : "").join(" "));
if (response.stop_reason === "end_turn") {
return response.content.find((c) => c.type === "text")?.text ?? "";
}
if (response.stop_reason === "tool_use") {
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) {
if (block.type === "tool_use") {
const handler = toolHandlers[block.name];
if (!handler) {
throw new Error(`Unhandled tool: ${block.name}`);
}
const result = await handler(block.input);
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: result,
});
}
}
context.append("user", JSON.stringify(toolResults));
turnCount++;
}
}
throw new Error("Tool execution exceeded maximum turns");
}
Rationale: The loop enforces a turn limit to prevent infinite recursion. Tool results are serialized as JSON strings to match the SDK's expected format. Handler validation fails fast, preventing silent degradation.
Pitfall Guide
1. Unbounded Context Accumulation
Explanation: Continuously appending messages without truncation eventually exceeds the model's context window, causing invalid_request_error or silent truncation.
Fix: Implement a sliding window or token-aware truncation strategy. Summarize older turns when approaching 75% of the context limit.
2. Ignoring Cache Hit Diagnostics
Explanation: Developers enable cache_control but never verify if caches are actually hitting. Without monitoring, cost savings remain theoretical.
Fix: Log response.usage.cache_read_input_tokens on every call. Alert when cache hit rates drop below 60%, indicating prompt drift or dynamic system instructions.
Explanation: Running tool handlers synchronously or sequentially when they could be parallelized increases latency unnecessarily.
Fix: Use Promise.all() for independent tool calls within a single turn. Note that Anthropic requires tool results to be submitted in the same user message, but execution can be concurrent.
4. Race Conditions in Streaming Buffers
Explanation: Directly writing stream deltas to UI state without debouncing or batching causes excessive re-renders and layout thrashing.
Fix: Buffer chunks in a StringBuilder or use a microtask queue. Flush to state only when a complete sentence or token batch arrives.
5. Hardcoded Stop Reason Assumptions
Explanation: Assuming stop_reason is always end_turn or tool_use ignores max_tokens truncation, which leaves responses incomplete.
Fix: Always check stop_reason === "max_tokens" and implement continuation logic or user-facing warnings for truncated outputs.
6. Dynamic System Prompts Breaking Caching
Explanation: Injecting timestamps, user IDs, or session tokens into the system prompt array invalidates the cache prefix, nullifying cost savings.
Fix: Keep system prompts static. Move dynamic context into the messages array or use structured metadata fields that don't affect caching.
7. Missing Rate Limit Backoff
Explanation: Retrying immediately on RateLimitError amplifies throttling and wastes quota.
Fix: Implement exponential backoff with jitter. Respect retry-after headers when present, and queue requests during sustained throttling.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Internal CLI automation | Synchronous create() | Simple, predictable, no UI latency concerns | Baseline |
| Customer-facing chat UI | Streaming stream() | Reduces perceived latency, enables typing indicators | Baseline |
| RAG pipeline with fixed instructions | Cached Context + create() | Reuses system prompt prefix across thousands of queries | ~90% input reduction |
| Multi-step data aggregation | Tool-Use Loop | Enables deterministic API chaining and validation | Baseline + tool execution time |
| High-concurrency webhook processing | Streaming + Queue + Retry | Prevents timeout failures under load spikes | Slight increase due to retries |
Configuration Template
// llm.config.ts
import Anthropic from "@anthropic-ai/sdk";
export const createAnthropicClient = () => {
return new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
timeout: 25000,
maxRetries: 2,
defaultHeaders: {
"anthropic-dangerous-direct-browser-access": "true", // Only if bypassing proxy
},
});
};
export const DEFAULT_SYSTEM_PROMPT = `You are a precision-focused technical assistant.
Always return structured data when requested.
Never hallucinate API responses or invent parameters.`;
export const TOOL_SCHEMA = {
lookup_inventory: {
name: "lookup_inventory",
description: "Retrieve stock levels for a given SKU",
schema: {
type: "object",
properties: {
sku: { type: "string", description: "Stock keeping unit identifier" },
warehouse: { type: "string", description: "Facility code" },
},
required: ["sku"],
},
},
};
Quick Start Guide
- Initialize the project: Run
npm init -y && npm install @anthropic-ai/sdk typescript ts-node @types/node. Create a tsconfig.json with module: "NodeNext" and strict: true.
- Set credentials: Export
ANTHROPIC_API_KEY in your shell or .env file. Verify with echo $ANTHROPIC_API_KEY.
- Bootstrap the orchestrator: Copy the
LLMOrchestrator and ContextManager classes into src/orchestrator.ts. Instantiate with new LLMOrchestrator({ model: "claude-sonnet-4-6" }).
- Test streaming: Implement the
streamResponse function with a simple callback that logs chunks to console. Send a 500-word generation request.
- Validate caching: Add
cache_control to your system prompt. Run the same request twice and compare usage.cache_read_input_tokens in the second response. Expect non-zero values on repeat calls.
Architecting LLM integrations requires treating the API as an event-driven orchestration layer, not a stateless function call. By enforcing context boundaries, monitoring cache efficiency, and structuring tool loops deterministically, you transform experimental prompts into reliable production systems.