- Each evaluation dimension (accuracy, safety, latency, cost, rubric compliance) implements a standardized
Evaluator interface. This enables composition, independent testing, and hot-swapping of evaluation strategies without modifying the runner.
- Rubric-First Evaluation: LLM outputs are evaluated against structured rubrics with explicit criteria, weightings, and scoring bands. Rubrics are versioned and stored as JSON schemas, enabling auditability and regression tracking.
- Calibrated LLM-as-Judge Fallback: When rubric matching is ambiguous or requires semantic understanding, a judge model is invoked with chain-of-thought prompting, temperature=0.2, and structured JSON output. Judge calls are cached by input hash and rubric fingerprint to eliminate redundant API costs.
- Deterministic Execution: Evaluation runs are idempotent. Inputs are hashed, metrics are seeded, and results are aggregated using weighted scoring with configurable thresholds. CI/CD gates can reject deployments based on metric deltas.
- Observability Integration: All evaluation results emit structured events (OpenTelemetry compatible) with metadata: model version, prompt fingerprint, timestamp, and cost breakdown. This enables drift detection and automated rollback triggers.
Step-by-Step Implementation
1. Define Evaluator Interface
interface EvaluationInput {
prompt: string;
expected?: string;
metadata?: Record<string, unknown>;
}
interface EvaluationResult {
metric: string;
score: number;
confidence: number;
details?: Record<string, unknown>;
latencyMs: number;
}
interface Evaluator {
name: string;
evaluate(input: EvaluationInput, output: string): Promise<EvaluationResult>;
}
2. Implement Rubric-Based Evaluator
interface RubricCriterion {
id: string;
description: string;
weight: number;
scoringBands: { min: number; max: number; label: string }[];
}
class RubricEvaluator implements Evaluator {
name = 'rubric-based';
private rubric: RubricCriterion[];
constructor(rubric: RubricCriterion[]) {
this.rubric = rubric;
}
async evaluate(input: EvaluationInput, output: string): Promise<EvaluationResult> {
const startTime = performance.now();
const scores = this.rubric.map(criterion => ({
id: criterion.id,
score: this.matchBand(output, criterion),
weight: criterion.weight
}));
const weightedScore = scores.reduce((acc, s) => acc + s.score * s.weight, 0);
const latency = performance.now() - startTime;
return {
metric: this.name,
score: weightedScore,
confidence: 0.85,
details: { breakdown: scores },
latencyMs: latency
};
}
private matchBand(output: string, criterion: RubricCriterion): number {
// Production: integrate semantic similarity or lightweight classifier
// Simplified for demonstration
const containsKeywords = criterion.description.split(' ').filter(w => output.toLowerCase().includes(w)).length;
const ratio = containsKeywords / criterion.description.split(' ').length;
return Math.min(1, ratio * 1.5);
}
}
3. Implement Calibrated LLM-as-Judge
import { openai } from '@ai-sdk/openai';
import { generateObject } from 'ai';
interface JudgeSchema {
score: number;
reasoning: string;
criteria_met: string[];
criteria_failed: string[];
}
class CalibratedJudgeEvaluator implements Evaluator {
name = 'calibrated-judge';
private judgeModel: any;
private rubricFingerprint: string;
constructor(model: any, rubricFingerprint: string) {
this.judgeModel = model;
this.rubricFingerprint = rubricFingerprint;
}
async evaluate(input: EvaluationInput, output: string): Promise<EvaluationResult> {
const startTime = performance.now();
const prompt = `
Evaluate the following LLM output against the rubric.
Rubric: ${JSON.stringify(input.metadata?.rubric)}
Prompt: ${input.prompt}
Output: ${output}
Return a JSON object with score (0-1), reasoning, and criteria breakdown.
`;
const result = await generateObject({
model: this.judgeModel,
schema: JudgeSchema,
prompt,
temperature: 0.2,
maxTokens: 512
});
const latency = performance.now() - startTime;
return {
metric: this.name,
score: result.object.score,
confidence: 0.78,
details: {
reasoning: result.object.reasoning,
rubric_fingerprint: this.rubricFingerprint,
criteria_met: result.object.criteria_met,
criteria_failed: result.object.criteria_failed
},
latencyMs: latency
};
}
}
4. Evaluation Runner with Composition and Caching
class EvaluationRunner {
private cache: Map<string, EvaluationResult> = new Map();
private evaluators: Evaluator[];
constructor(evaluators: Evaluator[]) {
this.evaluators = evaluators;
}
private hash(input: EvaluationInput, output: string): string {
return Buffer.from(`${input.prompt}|${output}|${this.evaluators.map(e => e.name).join(',')}`).toString('base64');
}
async run(input: EvaluationInput, output: string): Promise<EvaluationResult[]> {
const cacheKey = this.hash(input, output);
const cached = this.cache.get(cacheKey);
if (cached) return [cached];
const results = await Promise.all(
this.evaluators.map(async evaluator => {
const result = await evaluator.evaluate(input, output);
return result;
})
);
this.cache.set(cacheKey, results[0]);
return results;
}
aggregate(results: EvaluationResult[], weights: Record<string, number>): number {
return results.reduce((acc, r) => {
const weight = weights[r.metric] || 1;
return acc + r.score * weight;
}, 0) / Object.keys(weights).length;
}
}
5. Usage Example
const rubric: RubricCriterion[] = [
{ id: 'accuracy', description: 'factual correctness and alignment with prompt', weight: 0.4, scoringBands: [] },
{ id: 'safety', description: 'absence of harmful or biased content', weight: 0.3, scoringBands: [] },
{ id: 'format', description: 'adherence to requested structure', weight: 0.3, scoringBands: [] }
];
const evaluators = [
new RubricEvaluator(rubric),
new CalibratedJudgeEvaluator(openai('gpt-4o-mini'), 'v1_rubric_fingerprint')
];
const runner = new EvaluationRunner(evaluators);
const input: EvaluationInput = {
prompt: 'Summarize the quarterly revenue report focusing on Q3 growth drivers.',
metadata: { rubric, domain: 'finance' }
};
const output = 'Q3 revenue grew 12% YoY, driven by enterprise subscriptions and API usage expansion.';
const results = await runner.run(input, output);
const aggregatedScore = runner.aggregate(results, { 'rubric-based': 0.6, 'calibrated-judge': 0.4 });
console.log(`Aggregated Score: ${aggregatedScore.toFixed(2)}`);
This architecture ensures deterministic evaluation contracts, reduces judge API costs through caching and rubric-first routing, and provides granular metadata for production monitoring. The TypeScript implementation leverages strong typing for schema validation, enabling seamless integration with CI/CD pipelines and model registries.
Pitfall Guide
-
Treating LLM Output as Deterministic
LLMs exhibit stochastic behavior even at low temperature. Running evaluation once and treating the result as ground truth introduces measurement noise. Always execute multiple runs with seeded randomness or use deterministic routing (rubric-first, judge-fallback) to stabilize scores.
-
Optimizing for a Single Metric
Accuracy alone ignores latency, cost, safety, and user experience. A model scoring 95% on factual correctness but generating toxic content or exceeding budget constraints will fail in production. Use weighted multi-metric evaluation with explicit business thresholds.
-
Ignoring Prompt and Temperature Sensitivity
Evaluation results shift dramatically with minor prompt variations or temperature changes. Failing to lock prompt versions and temperature settings during evaluation creates false regression signals. Implement prompt fingerprinting and configuration immutability for eval runs.
-
Data Leakage in Evaluation Sets
Using training data, benchmark datasets, or publicly available examples in evaluation sets inflates scores and masks production failures. Curate eval sets from held-out production traffic, apply deduplication, and rotate subsets quarterly to prevent overfitting to eval distributions.
-
Overusing LLM-as-a-Judge Without Calibration
Raw judge prompts produce high variance and position bias. Without chain-of-thought structuring, temperature control, and rubric alignment, judge scores correlate poorly with human preference. Always calibrate judges against a small human-labeled subset and track judge drift over time.
-
Skipping Version Control for Prompts and Models
Evaluating without versioning makes it impossible to attribute score changes to model updates, prompt edits, or infrastructure shifts. Store prompt templates, model versions, and evaluation configurations in a registry. Tag evaluation runs with commit hashes and deployment IDs.
-
Neglecting Cost and Latency Tracking
Evaluation pipelines themselves consume compute. Unbounded judge API calls or synchronous metric execution can stall CI/CD pipelines. Implement parallel execution, response caching, and cost-aware routing. Set p95 latency budgets and reject pipelines that exceed them.
Best Practices from Production:
- Route to rubric evaluators first; invoke judges only when confidence falls below threshold.
- Cache judge responses using input hash + rubric fingerprint + model version.
- Emit structured evaluation events to observability platforms for drift detection.
- Run evaluation gates before model promotion to staging/production.
- Maintain a living eval dataset that mirrors production distribution shifts.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume customer support routing | Rubric-first + lightweight classifier | Deterministic, sub-50ms latency, scales to 10k+ RPS | Low ($0.01-0.05 per 1k) |
| Code generation and review | Calibrated LLM-as-judge with rubric fallback | Semantic understanding required; judge captures nuance | Medium ($0.45-0.80 per 1k) |
| Regulated financial/medical summarization | Multi-dimensional rubric + human-in-the-loop sampling | Compliance requires auditability; judge alone insufficient | High ($1.20-1.80 per 1k + human review) |
| Rapid prototype validation | Single-metric heuristic + cached judge | Fast iteration; acceptable for pre-production | Low-Medium ($0.15-0.30 per 1k) |
| Production regression monitoring | Versioned rubric + drift detection pipeline | Tracks score degradation across model/prompt updates | Medium ($0.30-0.50 per 1k) |
Configuration Template
evaluation:
version: "1.0"
run_id: "eval-2024-q3-production"
model:
name: "gpt-4o"
version: "2024-07-18"
temperature: 0.2
rubric:
fingerprint: "v2_finance_summarization"
criteria:
- id: "accuracy"
weight: 0.4
description: "factual alignment with source document"
- id: "safety"
weight: 0.3
description: "no unverified claims or regulatory violations"
- id: "structure"
weight: 0.3
description: "follows requested bullet-point format"
judges:
fallback_model: "gpt-4o-mini"
cache_ttl_hours: 24
temperature: 0.2
max_tokens: 512
thresholds:
min_aggregated_score: 0.75
max_p95_latency_ms: 300
max_cost_per_1k: 0.70
observability:
otel_endpoint: "https://otel.internal:4318"
tags:
team: "ai-platform"
environment: "staging"
prompt_version: "p-782"
Quick Start Guide
- Install dependencies:
npm install ai @ai-sdk/openai zod
- Define rubric schema: Create a JSON/YAML file with criteria, weightings, and scoring bands matching your production task.
- Initialize evaluators: Instantiate
RubricEvaluator and CalibratedJudgeEvaluator with your rubric and judge model configuration.
- Run evaluation pipeline: Pass input/output pairs through
EvaluationRunner, aggregate scores using business-weighted thresholds, and emit results to your observability stack.
- Integrate CI/CD gate: Add a pre-deployment step that runs evaluation against a holdout set; reject deployments if aggregated score falls below threshold or p95 latency exceeds budget.