l name, and pinned version tag.
- Prompt Artifact: The template string, variable bindings, and version hash.
- Hyperparameters: Temperature, top_p, max_tokens, seed.
- RAG Context: Embedding model, chunk size, similarity threshold, index version.
- Evaluation Baseline: Hash of the evaluation dataset used to validate this version.
Step 2: Compute Deterministic Hashes
Generate a content-addressable hash for the manifest. This hash serves as the unique identifier for the version. Use SHA-256 for integrity; avoid cryptographic overhead if performance is critical, but ensure collision resistance.
TypeScript Implementation:
import { createHash } from 'crypto';
export interface LLMVersionManifest {
provider: string;
model: string;
modelVersion: string; // Pinned tag, e.g., "gpt-4-0613"
promptTemplate: string;
promptVariables: Record<string, string | number | boolean>;
hyperparameters: {
temperature: number;
topP: number;
maxTokens: number;
seed?: number;
};
ragConfig?: {
embeddingModel: string;
chunkSize: number;
similarityThreshold: number;
indexVersion: string;
};
evalDatasetHash: string;
createdAt: number;
}
export class LLMVersionManager {
/**
* Computes a deterministic hash for the version manifest.
* Normalizes inputs to ensure consistent hashing across environments.
*/
static computeVersionHash(manifest: LLMVersionManifest): string {
const normalized = {
...manifest,
// Sort keys to ensure deterministic serialization
promptVariables: this.sortObject(manifest.promptVariables),
hyperparameters: this.sortObject(manifest.hyperparameters),
ragConfig: manifest.ragConfig ? this.sortObject(manifest.ragConfig) : undefined,
};
const payload = JSON.stringify(normalized);
return createHash('sha256').update(payload).digest('hex').substring(0, 12);
}
/**
* Validates that a manifest is production-ready.
* Checks for mutable references and missing critical fields.
*/
static validateManifest(manifest: LLMVersionManifest): ValidationResult {
const errors: string[] = [];
if (!manifest.modelVersion.includes('-')) {
errors.push('Model version must be pinned (e.g., gpt-4-0613). Rolling tags are prohibited.');
}
if (manifest.hyperparameters.temperature > 1.0) {
errors.push('Temperature > 1.0 increases variance; ensure eval coverage is sufficient.');
}
if (!manifest.evalDatasetHash) {
errors.push('Evaluation dataset hash is required for version registration.');
}
return {
valid: errors.length === 0,
errors,
versionId: this.computeVersionHash(manifest),
};
}
private static sortObject(obj: Record<string, any>): Record<string, any> {
return Object.keys(obj).sort().reduce((acc, key) => {
acc[key] = obj[key];
return acc;
}, {} as Record<string, any>);
}
}
interface ValidationResult {
valid: boolean;
errors: string[];
versionId: string;
}
Step 3: Architecture Decisions
Immutable Storage: Store version manifests in an append-only registry. Once a version is registered and evaluated, it must not be modified. Updates require a new version ID.
Snapshotting: For RAG systems, versioning must include snapshots of the vector index state. If the embedding model changes, the entire index must be rebuilt and versioned as a new artifact. Partial updates lead to hybrid retrieval failures.
Evaluation Binding: Every version must be bound to a specific evaluation run. The manifest should reference the evaluation results, including latency, cost, and accuracy metrics. This enables data-driven rollouts.
Code Example: Versioned Inference Wrapper
export class VersionedLLMClient {
private registry: Map<string, LLMVersionManifest>;
constructor() {
this.registry = new Map();
}
async registerVersion(manifest: LLMVersionManifest): Promise<string> {
const validation = LLMVersionManager.validateManifest(manifest);
if (!validation.valid) {
throw new Error(`Invalid manifest: ${validation.errors.join(', ')}`);
}
const versionId = validation.versionId;
this.registry.set(versionId, manifest);
return versionId;
}
async invoke(versionId: string, input: Record<string, any>): Promise<LLMResponse> {
const manifest = this.registry.get(versionId);
if (!manifest) {
throw new Error(`Version ${versionId} not found in registry.`);
}
// Render prompt with versioned template and variables
const prompt = this.renderPrompt(manifest.promptTemplate, {
...manifest.promptVariables,
...input,
});
// Call provider with pinned model and hyperparameters
const response = await this.callProvider({
model: `${manifest.provider}:${manifest.modelVersion}`,
prompt,
...manifest.hyperparameters,
});
// Attach version metadata to response for tracing
return {
...response,
metadata: {
versionId,
model: manifest.modelVersion,
promptHash: createHash('sha256').update(prompt).digest('hex').substring(0, 8),
},
};
}
private renderPrompt(template: string, variables: Record<string, any>): string {
return template.replace(/\{\{(\w+)\}\}/g, (_, key) => variables[key] ?? '');
}
private async callProvider(config: any): Promise<any> {
// Implementation for provider API call
return { content: 'Response', usage: { tokens: 100 } };
}
}
interface LLMResponse {
content: string;
usage: { tokens: number };
metadata: {
versionId: string;
model: string;
promptHash: string;
};
}
Pitfall Guide
1. Using Rolling Model Tags
- Mistake: Referencing
gpt-4 or claude-3-opus without a date suffix.
- Impact: Provider updates change model behavior silently. Your application may degrade or violate compliance requirements without any code change.
- Fix: Always pin to specific versions (e.g.,
gpt-4-0613). Implement a policy that rejects rolling tags in CI/CD.
2. Ignoring Hyperparameter Volatility
- Mistake: Versioning the model and prompt but treating temperature and top_p as runtime configuration.
- Impact: Changing temperature alters output distribution, breaking deterministic tests and user experience consistency.
- Fix: Include all inference parameters in the version manifest. Treat hyperparameters as code.
3. Mutable Prompt Storage
- Mistake: Storing prompts in a database with update capabilities and referencing them by ID.
- Impact: Updating a prompt changes the behavior of all active versions that reference it. You lose the ability to roll back to a previous prompt state.
- Fix: Store prompts as immutable artifacts. Prompts should be versioned alongside the manifest, not referenced dynamically.
4. Evaluation Data Drift
- Mistake: Reusing the same evaluation dataset across multiple versions without hashing or versioning the dataset.
- Impact: If the evaluation dataset changes, comparisons between versions become invalid. You cannot determine if performance changes are due to the model or the test data.
- Fix: Version evaluation datasets. Compute a hash of the dataset and include it in the manifest. Never mutate an evaluation set; create a new version instead.
5. RAG Index Inconsistency
- Mistake: Updating the embedding model or chunking strategy without rebuilding and versioning the vector index.
- Impact: Retrieval quality degrades due to semantic mismatch. The LLM receives irrelevant context, causing hallucinations.
- Fix: Version the RAG pipeline holistically. Any change to embedding, chunking, or indexing requires a new index version and a corresponding LLM version update.
6. Context Window Changes
- Mistake: Upgrading to a model with a different context window without adjusting prompt truncation logic.
- Impact: Prompts may be truncated differently, losing critical information. Or, costs may spike if the new model charges differently for context.
- Fix: Include context window limits in the manifest validation. Test truncation behavior for each version.
7. Missing Seed Determinism
- Mistake: Relying on a seed for reproducibility without versioning the seed or understanding provider support.
- Impact: Seeds may not be supported by all providers or models. Even with a seed, non-deterministic hardware operations can cause variance.
- Fix: Use seeds for debugging and testing, but do not rely on them for production determinism. Version the seed value, but design evaluations to account for stochastic variance.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Prototype / MVP | Pinned model tags + Prompt versioning | Balances speed with safety. Avoids rolling updates while allowing rapid iteration. | Low |
| Production RAG App | Full Manifest + Index Versioning + Eval Binding | Ensures retrieval consistency and allows precise rollback of the entire pipeline. | Medium |
| Fine-Tuned Model | Model Checkpoint Versioning + Eval Dataset Hash | Fine-tuned models require tracking training data and hyperparameters for reproducibility. | High |
| Multi-Provider Setup | Abstraction Layer + Provider-Specific Manifests | Normalizes versioning across providers while capturing provider-specific constraints. | Medium |
| Regulated Environment | Immutable Registry + Audit Trail + Signed Manifests | Required for compliance. Ensures full traceability and tamper-proof version history. | High |
Configuration Template
Use this JSON schema as a template for version manifests. Store this in your registry or version control.
{
"versionId": "a1b2c3d4e5f6",
"provider": "openai",
"model": "gpt-4",
"modelVersion": "gpt-4-0613",
"prompt": {
"template": "Answer the question based on the context. Question: {{question}} Context: {{context}}",
"variables": {
"system_role": "You are a helpful assistant."
},
"hash": "e8d5f6a7b8c9d0e1"
},
"hyperparameters": {
"temperature": 0.2,
"topP": 0.9,
"maxTokens": 500,
"seed": 42
},
"rag": {
"embeddingModel": "text-embedding-3-large",
"chunkSize": 512,
"similarityThreshold": 0.75,
"indexVersion": "idx_v2_20240520",
"hash": "f9a8b7c6d5e4f3a2"
},
"evaluation": {
"datasetHash": "d4e5f6a7b8c9d0e1",
"datasetName": "qa_benchmark_v3",
"metrics": {
"accuracy": 0.92,
"latency_p99": 1200,
"cost_per_1k": 0.03
}
},
"metadata": {
"createdAt": "2024-05-20T10:00:00Z",
"createdBy": "ci-pipeline",
"status": "registered"
}
}
Quick Start Guide
- Install Dependencies: Add a hashing library and define the
LLMVersionManifest interface in your project.
npm install crypto
- Create a Version Manager: Implement the
LLMVersionManager class to compute hashes and validate manifests. Use the code provided in the Core Solution.
- Define Your First Manifest: Create a JSON file for your initial LLM configuration. Fill in all fields, including pinned model version, prompt template, hyperparameters, and evaluation dataset hash.
- Register and Validate: Run the
validateManifest function on your configuration. Fix any errors (e.g., rolling tags, missing hashes). Register the valid manifest in your version manager.
- Integrate with Inference: Wrap your LLM client with the
VersionedLLMClient. Pass the versionId to the invoke method. Ensure all responses include version metadata for tracing.
By implementing this versioning strategy, you eliminate stochastic blind spots, ensure reproducible deployments, and establish a foundation for continuous evaluation and improvement of your LLM applications.