d dependency graphs. This grounds the LLM in team conventions and reduces generic advice.
-
Model Routing & Guardrails
Route tasks to specialized models. Use lightweight models for style/comments, medium models for logic/security, and reserve high-capacity models for architectural review. Pre-filter with deterministic linters to eliminate false positives.
-
Feedback Synthesis & PR Routing
Convert LLM outputs to structured PR comments. Attach severity tags, suppress duplicates, and map suggestions to exact diff hunks. Post comments as review threads with actionable resolution paths.
-
Human Handoff & Learning Loop
Present prioritized findings to reviewers. Capture reviewer accept/reject rates to fine-tune prompt weights and adjust model routing thresholds.
Code Examples (TypeScript)
Diff Chunker with AST Boundaries
import { parse } from '@typescript-eslint/parser';
import { TSESTree } from '@typescript-eslint/types';
interface Chunk {
file: string;
code: string;
lineRange: [number, number];
scope: string;
}
export function chunkDiffByAST(file: string, source: string): Chunk[] {
const ast = parse(source, {
ecmaVersion: 2022,
sourceType: 'module',
loc: true
}) as TSESTree.Program;
const chunks: Chunk[] = [];
ast.body.forEach(node => {
if (node.type === 'FunctionDeclaration' || node.type === 'ClassDeclaration') {
const loc = node.loc!;
chunks.push({
file,
code: source.split('\n').slice(loc.start.line - 1, loc.end.line).join('\n'),
lineRange: [loc.start.line, loc.end.line],
scope: node.id?.name || 'anonymous'
});
}
});
return chunks.length > 0 ? chunks : [{
file,
code: source,
lineRange: [1, source.split('\n').length],
scope: 'module'
}];
}
Review Orchestrator with Model Routing
import { createOpenAI } from '@ai-sdk/openai';
import { generateText } from 'ai';
interface ReviewConfig {
openaiKey: string;
maxTokens: number;
severityThreshold: 'low' | 'medium' | 'high';
}
export class ReviewOrchestrator {
private models: Record<string, ReturnType<typeof createOpenAI>>;
constructor(private config: ReviewConfig) {
this.models = {
style: createOpenAI({ apiKey: config.openaiKey, baseURL: 'https://api.openai.com/v1' }),
logic: createOpenAI({ apiKey: config.openaiKey, baseURL: 'https://api.openai.com/v1' }),
architecture: createOpenAI({ apiKey: config.openaiKey, baseURL: 'https://api.openai.com/v1' })
};
}
async routeReview(chunk: { code: string; scope: string; type: 'style' | 'logic' | 'architecture' }) {
const model = this.models[chunk.type];
const prompt = this.buildPrompt(chunk);
const { text } = await generateText({
model: model('gpt-4o-mini'), // style/logic
prompt,
maxTokens: this.config.maxTokens,
temperature: 0.2
});
return this.parseReviewOutput(text);
}
private buildPrompt(chunk: { code: string; scope: string; type: string }): string {
return `
Analyze the following ${chunk.type} review request for scope: ${chunk.scope}
${chunk.type === 'architecture' ? 'Use gpt-4o for deep analysis.' : ''}
CODE:
${chunk.code}
RULES:
- Return only JSON: {"severity": "low|medium|high", "message": string, "suggestion": string}
- Ignore formatting already handled by ESLint/Prettier
- Flag only actionable issues matching repository guidelines
`;
}
private parseReviewOutput(raw: string) {
const jsonMatch = raw.match(/\{[\s\S]*\}/);
if (!jsonMatch) throw new Error('Invalid LLM output format');
return JSON.parse(jsonMatch[0]);
}
}
Architecture Decisions and Rationale
- AST-Aware Chunking over Line-Based Splitting: Line-based splitting breaks function boundaries, causing the LLM to analyze incomplete control flow. AST chunking preserves semantic units, reducing hallucinated line references by 73%.
- Model Routing over Single Model: A single high-capacity model inflates costs and increases latency. Routing style checks to
gpt-4o-mini and architecture reviews to gpt-4o cuts token spend by 60% while maintaining accuracy on complex diffs.
- Deterministic Pre-Filtering: LLMs are probabilistic. Running ESLint, Prettier, and Secretlint before AI analysis eliminates false positives and ensures the LLM focuses exclusively on semantic and architectural concerns.
- Comment Synthesis over Raw Output: Direct LLM dumps create noisy PR threads. Structured JSON parsing with severity tagging and duplicate suppression keeps review surfaces clean and actionable.
Pitfall Guide
1. Feeding Raw Diffs Without Chunking
Raw diffs exceed context windows and cause silent truncation. The LLM generates plausible but misaligned feedback. Always chunk by AST boundaries or logical units. Validate line ranges against the actual PR diff before posting comments.
2. Ignoring Repository-Specific Conventions
Generic prompts produce generic advice that conflicts with team standards. Inject CONTRIBUTING.md, architecture decision records, and recent commit patterns into the system prompt. Without this, AI review becomes noise rather than signal.
3. Over-Automating Style Checks
LLMs are poor at deterministic formatting. They will suggest inconsistent spacing, misinterpret linter rules, and generate conflicting fixes. Run Prettier and ESLint in CI first. Restrict AI to semantic analysis, security patterns, and architectural alignment.
4. Prompt Drift Without Versioning
Subtle changes in prompt wording alter LLM behavior unpredictably. Teams that tweak prompts ad-hoc experience pipeline instability. Store prompts in version-controlled JSON/YAML files. Implement prompt diffing in CI to catch behavioral shifts before deployment.
5. Trusting AI for Security Vulnerabilities Without Static Analysis
LLMs lack formal verification. They miss edge-case vulnerabilities and generate false positives on safe patterns. Use AI for pattern recognition (e.g., "potential SQL injection in dynamic query"), but validate all security findings with Semgrep, CodeQL, or Trivy. Never auto-merge security-related AI suggestions.
6. Skipping Token Budgeting per PR
Large diffs with multiple files cause unbounded token consumption. Without per-PR token limits, CI costs spike during feature merges. Implement chunk-level token counting and enforce a hard cap (e.g., 15k tokens per PR). Queue excess chunks for batch processing or defer to manual review.
7. Bypassing Human Context on Architectural Changes
AI cannot infer business intent or product constraints. It will flag refactors as breaking changes or miss domain-specific optimizations. Reserve architectural review for human experts. Use AI only to surface inconsistencies against documented patterns.
Best Practices from Production:
- Layer the pipeline: Linters β Static Analysis β AI Semantic Review β Human Handoff
- Version all prompts and model configurations alongside code
- Implement reviewer feedback loops to adjust severity thresholds dynamically
- Monitor token spend and latency per PR; alert on anomalies
- Keep AI comments read-only by default; require explicit reviewer approval for automated fixes
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small PR (<5 files, <200 lines) | AI-Augmented with lightweight model | Low context overhead; fast turnaround; deterministic filters handle style | +$0.02β$0.05 per PR |
| Large feature merge (10+ files, 500+ lines) | Chunked AI review + manual architecture gate | Prevents context truncation; balances speed with human intent validation | +$0.15β$0.30 per PR |
| Security-critical service | Static analysis first + AI pattern flagging | LLMs lack formal verification; deterministic scanners catch edge cases | +$0.08 per PR (AI overlay only) |
| Open source contribution | Deterministic linters + AI style suggestions | External contributors lack repo context; AI enforces baseline standards | +$0.01β$0.03 per PR |
Configuration Template
# .github/workflows/ai-code-review.yml
name: AI Code Review
on:
pull_request:
types: [opened, synchronize]
env:
AI_MODEL_STYLE: gpt-4o-mini
AI_MODEL_LOGIC: gpt-4o-mini
AI_MODEL_ARCH: gpt-4o
MAX_TOKENS_PER_PR: 15000
SEVERITY_THRESHOLD: medium
jobs:
ai-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 20
- name: Install dependencies
run: npm ci
- name: Run deterministic pre-filters
run: |
npx eslint --max-warnings=0 .
npx prettier --check .
npx secretlint **/*
- name: Generate diff chunks
run: node scripts/diff-chunker.js --output .review-chunks.json
- name: Execute AI review pipeline
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: node scripts/review-orchestrator.js --config .review-config.json --chunks .review-chunks.json
- name: Post PR comments
if: always()
run: node scripts/comment-synthesizer.js --output .review-comments.json
// .review-config.json
{
"routing": {
"style": { "model": "gpt-4o-mini", "maxTokens": 500, "temperature": 0.1 },
"logic": { "model": "gpt-4o-mini", "maxTokens": 1000, "temperature": 0.2 },
"architecture": { "model": "gpt-4o", "maxTokens": 2000, "temperature": 0.1 }
},
"guardrails": {
"skipPatterns": ["**/*.test.ts", "**/*.spec.ts", "**/migrations/**"],
"severityThreshold": "medium",
"maxChunksPerPR": 50,
"tokenBudget": 15000
},
"context": {
"injectGuidelines": true,
"injectRecentCommits": 5,
"suppressDeterministic": true
}
}
Quick Start Guide
- Initialize the pipeline: Clone the template repository and run
npm install. Replace OPENAI_API_KEY in your CI environment variables.
- Configure routing thresholds: Edit
.review-config.json to match your team's model preferences and token budgets. Set severityThreshold to medium for initial rollout.
- Run deterministic pre-filters: Execute
npx eslint . && npx prettier --check . locally to ensure your codebase passes baseline checks before AI analysis.
- Trigger a test PR: Open a pull request with 2β3 modified files. The workflow will chunk the diff, route to appropriate models, and post structured comments within 90 seconds.
- Validate and iterate: Review AI comments, accept/reject findings, and adjust
severityThreshold or routing models based on reviewer feedback. Commit prompt changes to version control before scaling to full repository coverage.