.
Core Solution
The pipeline executes a five-step evaluation for every prompt in the suite:
- Load: A YAML prompt suite is parsed into a
PromptSuite Pydantic model. Each entry contains an ID, text, category, tags, and expected behavior description.
- Run: Prompts are dispatched to Model A and Model B via
LLMRunner. Supported providers: Ollama (/api/generate), OpenRouter (chat completions), and a deterministic stub provider for offline CI validation.
- Score: Response pairs are evaluated using
EmbeddingDiffer (cosine similarity on all-MiniLM-L6-v2) or SimpleDiffer (Jaccard over words). An optional LLM-as-judge score is fused with the embedding score for ambiguous cases.
- Classify: Results are bucketed against the
--threshold. Severity classification isolates critical regressions from minor stylistic variations.
- Report: An HTML report is generated for artifact storage, alongside a rich terminal summary table.
Installation & Setup
pip install -e .
Requires Python 3.11+. Embedding similarity uses sentence-transformers/all-MiniLM-L6-v2, downloaded on first use. The LLM-judge path requires OPENROUTER_API_KEY without it, scoring falls back to embeddings-only.
Offline CI Validation (Stub Provider)
llm-diff run \
--model-a stub-a --provider-a stub \
--model-b stub-b --provider-b stub \
--prompts prompts/default.yaml \
--output output/report.html \
--no-use-embeddings
Real output from this run (stub + Jaccard, threshold 0.5):
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ LLM Behavior Diff โ
โ Detecting behavioral shifts between model updates โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Processing: safety-001 โโโโโโโโโโโโโโโโโโโโ 100%
Comparison Summary
โโโโโโโโโโโโโโโโโโโโณโโโโโโโโ
โ Metric โ Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Total Prompts โ 5 โ
โ Changes Detected โ 3 โ
โ Change Rate โ 60.0% โ
โ Avg Similarity โ 40.0% โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโ
Report saved to: output/stub_jaccard.html
Production Validation (OpenRouter)
export OPENROUTER_API_KEY=sk-or-...
llm-diff run \
--model-a meta-llama/llama-3.2-3b-instruct --provider-a openrouter \
--model-b google/gemini-2.0-flash-lite-001 --provider-b openrouter \
--prompts prompts/default.yaml \
--output output/or_emb.html \
--use-embeddings --threshold 0.85
Real output (embeddings only):
Comparison Summary
โโโโโโโโโโโโโโโโโโโโณโโโโโโโโ
โ Total Prompts โ 5 โ
โ Changes Detected โ 0 โ
โ Change Rate โ 0.0% โ
โ Avg Similarity โ 91.4% โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโ
Adding --use-judge brings the average similarity to 91.8% and surfaces reasoning like: "Both responses correctly answer 'yes' and provide essentially the same explanation... Response A is slightly more verbose, but the core meaning is identical."
Local Validation (Ollama)
llm-diff run \
--model-a qwen3:8b --provider-a ollama \
--model-b gemma4:e4b --provider-b ollama \
--prompts prompts/default.yaml \
--output output/report.html \
--use-embeddings --threshold 0.85
CLI Reference
llm-diff --help
Usage: llm-diff [OPTIONS] COMMAND [ARGS]...
LLM Behavior Diff โ Model Update Detector
--version Show version information
--help Show this message and exit.
Commands
run Run a comparison between two models.
llm-diff --version
LLM Behavior Diff version 0.1.0
Key options for llm-diff run:
Severity buckets applied when a change is detected: combined >= 0.7 is minor, >= 0.4 is moderate, < 0.4 is major.
Prompt Suite Format
name: "My suite"
version: "1.0.0"
prompts:
- id: "code-001"
text: "Write a Python function reverse_string(s)..."
category: "coding"
tags: ["python"]
expected_behavior: "Short correct function"
IDs must be unique. Category must be one of: reasoning, coding, creativity, safety, instruction_following, factual, conversational.
Python API
from llm_behavior_diff.runner import run_prompt_sync
from llm_behavior_diff.models import ModelConfig, ProviderType
resp = run_prompt_sync(
ModelConfig(name="stub-m", provider=ProviderType.STUB),
prompt_id="p1",
prompt_text="hello world",
)
print(resp.text, resp.success)
# -> Model stub-m says: 921fac0c4c True
Pitfall Guide
- Benchmark-Only Validation Trap: Relying exclusively on aggregate benchmark scores masks category-specific regressions. Always run behavioral diffs against your actual user prompt distribution before deployment.
- Token-Level Comparison Over-Flagging: Using Jaccard or exact string matching on LLM outputs generates high false positive rates due to paraphrasing. Default to embedding-based cosine similarity unless lexical precision is explicitly required.
- Uncalibrated Threshold Settings: Applying a fixed
--threshold (e.g., 0.85) across all domains causes misclassification. Calibrate thresholds per category; safety and coding prompts typically require stricter thresholds (>0.85) than conversational ones (~0.6).
- Ignoring LLM-Judge Fallback for Ambiguity: Embeddings capture semantic proximity but lack reasoning context. Enable
--use-judge for borderline scores (0.4โ0.7) to prevent false negatives on nuanced instruction-following drift.
- CI/CD Provider Misconfiguration: Failing to inject
OPENROUTER_API_KEY or configure Ollama endpoints in CI runners causes silent fallbacks to stub/jaccard modes. Validate provider connectivity in a pre-flight step before diff execution.
- Prompt Suite Version Drift: Not version-controlling prompt suites alongside model artifacts breaks reproducibility. Tag prompt YAMLs with semantic versions and map them to model releases in your CI pipeline.
Deliverables
- ๐ Behavioral Diff Blueprint: End-to-end CI/CD integration guide covering stub testing, OpenRouter/Ollama production runs, artifact storage, and threshold calibration workflows.
- โ
Pre-Deployment Checklist: 12-step validation protocol including provider auth verification, embedding model caching, prompt suite versioning, severity bucket alignment, and report archiving.
- โ๏ธ Configuration Templates: Production-ready
prompts/default.yaml schema, llm-diff run CLI scripts for multi-provider environments, and Python API integration snippets for custom automation pipelines.