e a trace containing input prompts, model parameters, output tokens, and metadata. Traces must link user sessions to model invocations for debugging.
2. Async Evaluation Pipeline: Semantic evaluations (e.g., LLM-as-a-judge, embedding comparisons) are computationally expensive. They should run asynchronously in a sidecar or dedicated worker to avoid adding latency to the critical path.
3. PII Redaction at Ingestion: Prompts and outputs often contain sensitive data. Redaction must occur at the SDK level before data leaves the application environment.
4. RAG-Specific Metrics: For RAG systems, observability must capture retrieval metrics (chunk relevance, vector similarity scores) alongside generation metrics.
Step-by-Step Implementation
1. Instrumentation with OpenTelemetry
Use an SDK that wraps LLM clients and emits spans compliant with OpenTelemetry semantic conventions for GenAI.
2. Semantic Evaluation Integration
Implement evaluators that run against trace data. Common evaluators include:
- Faithfulness: Does the output contradict the retrieved context?
- Answer Relevance: Does the output answer the user query?
- Context Precision: Was the retrieved context useful?
3. Drift Detection
Monitor the distribution of embeddings for user queries and model outputs. Statistical tests (e.g., Kolmogorov-Smirnov) detect distribution shifts indicating prompt drift or user behavior changes.
Code Example: TypeScript Implementation
This example demonstrates a wrapper pattern for instrumenting an LLM call with AI observability, including PII redaction and async evaluation triggers.
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
import { SemanticAttributes } from '@opentelemetry/semantic-conventions';
import { PIIRedactor } from './security/pii-redactor';
import { EvaluationEngine } from './evaluation/engine';
import { LLMClient } from './llm/client';
// AI Observability Decorator
function observeAI(options: {
model: string;
trackCost: boolean;
evaluateQuality: boolean;
}) {
return function (
target: any,
propertyKey: string,
descriptor: PropertyDescriptor
) {
const originalMethod = descriptor.value;
descriptor.value = async function (...args: any[]) {
const tracer = trace.getTracer('ai-observability');
const span = tracer.startSpan(`ai.llm.${propertyKey}`);
// 1. Redact Input PII before tracing
const inputPrompt = args[0];
const redactedPrompt = PIIRedactor.redact(inputPrompt);
span.setAttribute(SemanticAttributes.GEN_AI_SYSTEM, 'custom');
span.setAttribute(SemanticAttributes.GEN_AI_REQUEST_MODEL, options.model);
span.setAttribute(SemanticAttributes.GEN_AI_PROMPT, redactedPrompt);
try {
// 2. Execute LLM Call
const result = await originalMethod.apply(this, args);
// 3. Capture Output and Metadata
const redactedOutput = PIIRedactor.redact(result.text);
span.setAttribute(SemanticAttributes.GEN_AI_COMPLETION, redactedOutput);
span.setAttribute(SemanticAttributes.GEN_AI_USAGE_PROMPT_TOKENS, result.usage.promptTokens);
span.setAttribute(SemanticAttributes.GEN_AI_USAGE_COMPLETION_TOKENS, result.usage.completionTokens);
if (options.trackCost) {
const cost = calculateCost(result.usage, options.model);
span.setAttribute('gen_ai.cost.total', cost);
}
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
// 4. Trigger Async Evaluation
if (options.evaluateQuality) {
EvaluationEngine.evaluate({
traceId: span.spanContext().traceId,
input: inputPrompt,
output: result?.text || '',
model: options.model
}).catch(err => console.error('Evaluation pipeline error:', err));
}
}
};
};
}
// Usage in Service Class
class ChatService {
private llm = new LLMClient();
@observeAI({
model: 'gpt-4o',
trackCost: true,
evaluateQuality: true
})
async generateResponse(prompt: string): Promise<{ text: string; usage: any }> {
// Actual LLM invocation
return this.llm.chat(prompt);
}
}
Rationale:
- Decorator Pattern: Keeps observability logic decoupled from business logic, allowing reuse across services.
- PII Redaction: Ensures compliance with GDPR/CCPA by never storing raw sensitive data in observability backends.
- Async Evaluation: Prevents evaluation latency from impacting user-facing response times.
- Cost Attribution: Captures cost per span, enabling granular cost analysis per feature or user segment.
Pitfall Guide
Common Mistakes in AI Observability
-
Logging Raw PII in Traces:
- Mistake: Storing user emails, phone numbers, or health data in observability backends.
- Impact: Compliance violations, data breaches, and legal liability.
- Fix: Implement mandatory redaction at the SDK level. Use tokenization for sensitive fields if analysis is required.
-
Ignoring Token Cost Drift:
- Mistake: Monitoring only request counts while ignoring token consumption.
- Impact: Bill shock. A slight change in prompt engineering can double token usage without affecting latency.
- Fix: Alert on
tokens_per_request and cost_per_session anomalies. Implement token budgets per user tier.
-
Treating All Errors Equally:
- Mistake: Aggregating HTTP 500s and "Hallucination" errors into the same error rate metric.
- Impact: Masking quality issues. Infrastructure errors are urgent; quality errors require model tuning.
- Fix: Segment errors by type:
INFRA_FAILURE, RATE_LIMIT, QUALITY_DEGRADATION, SAFETY_VIOLATION.
-
No Baseline for "Good":
- Mistake: Monitoring metrics without defining thresholds based on historical performance or gold-standard evaluations.
- Impact: Alert fatigue or missed detections.
- Fix: Establish baselines using evaluation datasets. Dynamic thresholds should adapt to weekly patterns.
-
Neglecting RAG Retrieval Metrics:
- Mistake: Monitoring only the generation step in RAG systems.
- Impact: Blindness to retrieval failures. The model may generate plausible but incorrect answers based on poor context.
- Fix: Instrument vector search latency, chunk similarity scores, and retrieval recall. Correlate retrieval quality with generation faithfulness.
-
Over-Reliance on LLM-as-a-Judge:
- Mistake: Using an LLM to evaluate itself without human validation or heuristic checks.
- Impact: Evaluation bias and circular reasoning. The judge model may favor its own style over correctness.
- Fix: Hybrid evaluation: Combine LLM judges with deterministic checks (regex, keyword presence, citation verification) and periodic human review.
-
Prompt Versioning Gaps:
- Mistake: Updating prompts without versioning and linking them to traces.
- Impact: Inability to rollback or attribute quality changes to specific prompt updates.
- Fix: Version all prompts. Include
prompt_version in trace metadata. Enable A/B testing with trace segmentation.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / MVP | SDK-based tracing + Manual Review | Low overhead; focuses on core functionality and quick feedback. | Low infrastructure cost; high manual effort. |
| Enterprise RAG | Full Observability Suite + Async LLM Judges | Requires deep visibility into retrieval/generation; compliance needs PII redaction and audit trails. | Moderate infrastructure cost; evaluation costs scale with traffic. |
| High-Volume Chatbot | Stream-based Metrics + Cost Budgets | Latency sensitivity requires async evaluation; cost control is critical at scale. | High evaluation cost; mitigated by cost budgeting and caching. |
| Safety-Critical App | Real-time Safety Filters + Human-in-the-Loop | Zero tolerance for violations; requires immediate blocking and human review queues. | High latency overhead for safety checks; high operational cost for review. |
Configuration Template
This YAML configuration demonstrates how to define observability rules, thresholds, and redaction policies for an AI monitoring system.
ai_observability:
tracing:
enabled: true
sample_rate: 1.0
redaction:
enabled: true
patterns:
- type: email
regex: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
- type: ssn
regex: '\b\d{3}-\d{2}-\d{4}\b'
mask_char: '*'
metrics:
custom_metrics:
- name: hallucination_rate
type: gauge
threshold:
warning: 0.05
critical: 0.10
- name: cost_per_request
type: histogram
buckets: [0.001, 0.005, 0.01, 0.05, 0.10]
- name: embedding_drift_score
type: gauge
threshold:
warning: 0.15
critical: 0.25
evaluation:
pipeline: async
judges:
- name: faithfulness_check
model: judge-model-v2
prompt_template: "Evaluate if the output is faithful to the context."
score_threshold: 0.8
- name: safety_scan
model: safety-model-v1
action: block_if_violation
alerts:
- name: CostAnomaly
condition: cost_per_request > p95 * 1.5
duration: 5m
notify: [finance-team, engineering-lead]
- name: QualityDegradation
condition: hallucination_rate > 0.10
duration: 10m
notify: [ai-team, product-manager]
Quick Start Guide
- Install SDK: Add the observability SDK to your project dependencies.
npm install @codcompass/ai-observability
- Initialize Client: Configure the SDK with your API key and redaction settings.
import { initObservability } from '@codcompass/ai-observability';
initObservability({
apiKey: process.env.OBSERVABILITY_API_KEY,
redaction: { enabled: true, patterns: ['email', 'phone'] }
});
- Wrap LLM Calls: Apply the decorator or wrapper to your LLM invocation methods.
@observeAI({ model: 'gpt-4o', evaluateQuality: true })
async askAI(prompt: string) { ... }
- Define Metrics: Configure quality thresholds and cost alerts in the dashboard or configuration file.
- Deploy & Validate: Deploy the changes. Verify traces appear in the observability backend and that PII is redacted. Run a synthetic evaluation job to populate baseline metrics.