hitecture Decisions
- AST Parsing Over Regex: Regex cannot reliably distinguish between a string literal assignment, a function parameter, or a dictionary key. AST parsing guarantees structural accuracy and enables parent-node context extraction.
- Feature Vector Construction: Instead of hardcoding thresholds, the scanner builds a normalized vector. This allows the scoring engine to be swapped between heuristic rules and machine learning models without refactoring the extraction layer.
- Semantic Vocabulary Weighting: The identifier score is derived from a curated lexicon of credential-related terms, abbreviations, and contextual modifiers. This captures the 0.28 importance weight without requiring full NLP models.
- Contextual Override Layer: Framework patterns (ORM fields, API route parameters) are filtered using parent-node type checking. This prevents false positives from schema definitions.
Implementation
import { parse } from '@typescript-eslint/parser';
import { TSESTree } from '@typescript-eslint/types';
// Semantic lexicon with base risk weights
const CREDENTIAL_LEXICON: Record<string, number> = {
password: 0.95, passwd: 0.90, pwd: 0.85,
secret: 0.90, secret_key: 0.92, client_secret: 0.93,
api_key: 0.88, apikey: 0.87, api_token: 0.89,
token: 0.75, access_token: 0.85, auth_token: 0.86,
private_key: 0.94, privkey: 0.91, pem: 0.80,
credential: 0.88, credentials: 0.89, creds: 0.82,
database_url: 0.85, db_url: 0.84, connection_string: 0.86,
};
const NON_SENSITIVE_LEXICON: Record<string, number> = {
checksum: 0.05, hash: 0.08, digest: 0.06, fingerprint: 0.07,
uuid: 0.04, guid: 0.04, id: 0.15, identifier: 0.12,
version: 0.03, release: 0.03, build: 0.03,
color: 0.02, hex: 0.05, integrity: 0.06, signature: 0.10,
};
interface FeatureVector {
identifierScore: number;
entropy: number;
length: number;
patternMatch: boolean;
contextType: string;
}
function calculateShannonEntropy(input: string): number {
const freq: Record<string, number> = {};
for (const char of input) {
freq[char] = (freq[char] || 0) + 1;
}
let entropy = 0;
const len = input.length;
for (const char in freq) {
const p = freq[char] / len;
entropy -= p * Math.log2(p);
}
return entropy;
}
function extractSemanticScore(identifier: string): number {
const normalized = identifier.toLowerCase().replace(/[_\-]/g, '');
// Direct match
if (CREDENTIAL_LEXICON[identifier.toLowerCase()]) {
return CREDENTIAL_LEXICON[identifier.toLowerCase()];
}
if (NON_SENSITIVE_LEXICON[identifier.toLowerCase()]) {
return NON_SENSITIVE_LEXICON[identifier.toLowerCase()];
}
// Substring/abbreviation fallback
const abbreviations = ['pass', 'pwd', 'sk', 'cs', 'tkn', 'cred', 'auth', 'secret', 'key', 'token'];
for (const abbr of abbreviations) {
if (normalized.includes(abbr)) return 0.75;
}
return 0.30; // Default neutral score
}
function buildFeatureVector(
node: TSESTree.Literal,
parent: TSESTree.Node
): FeatureVector {
const value = String(node.value);
let identifierScore = 0.30;
let contextType = 'unknown';
// Extract identifier from assignment or property
if (parent.type === 'VariableDeclarator' && parent.id.type === 'Identifier') {
identifierScore = extractSemanticScore(parent.id.name);
contextType = 'variable_assignment';
} else if (parent.type === 'Property' && parent.key.type === 'Identifier') {
identifierScore = extractSemanticScore(parent.key.name);
contextType = 'object_property';
} else if (parent.type === 'AssignmentExpression' && parent.left.type === 'Identifier') {
identifierScore = extractSemanticScore(parent.left.name);
contextType = 'assignment_expression';
}
return {
identifierScore,
entropy: calculateShannonEntropy(value),
length: value.length,
patternMatch: /^(sk_live_|ghp_|AKIA|Bearer\s)/.test(value),
contextType,
};
}
function evaluateSecretRisk(vector: FeatureVector): { risk: number; action: string } {
// Weighted scoring reflecting empirical feature importance
const weightedScore =
(vector.identifierScore * 0.28) +
(Math.min(vector.entropy / 4.0, 1.0) * 0.22) +
(vector.patternMatch ? 0.35 : 0.0) +
(vector.length > 16 ? 0.15 : 0.0);
if (weightedScore >= 0.65) {
return { risk: weightedScore, action: 'BLOCK' };
} else if (weightedScore >= 0.45) {
return { risk: weightedScore, action: 'WARN' };
}
return { risk: weightedScore, action: 'ALLOW' };
}
export function scanSourceCode(source: string): Array<{ line: number; risk: number; action: string }> {
const ast = parse(source, { loc: true, range: true });
const findings: Array<{ line: number; risk: number; action: string }> = [];
// Simple DFS traversal
function traverse(node: TSESTree.Node, parent?: TSESTree.Node) {
if (node.type === 'Literal' && typeof node.value === 'string' && parent) {
const vector = buildFeatureVector(node, parent);
const result = evaluateSecretRisk(vector);
if (result.action !== 'ALLOW') {
findings.push({ line: node.loc?.start.line || 0, ...result });
}
}
for (const key in node) {
if (key === 'loc' || key === 'range' || key === 'parent') continue;
const child = (node as any)[key];
if (child && typeof child === 'object') {
if (Array.isArray(child)) {
child.forEach(c => c && typeof c === 'object' && c.type && traverse(c, node));
} else if (child.type) {
traverse(child, node);
}
}
}
}
traverse(ast);
return findings;
}
Rationale
- AST Traversal: Guarantees structural awareness. The scanner knows whether a string is a variable assignment, object key, or function argument. This prevents false positives from schema definitions or route handlers.
- Lexicon-Driven Scoring: The
CREDENTIAL_LEXICON and NON_SENSITIVE_LEXICON capture the semantic signal that drives the 0.28 feature importance. Abbreviation fallback handles conventional shorthand without requiring exhaustive pattern lists.
- Weighted Evaluation: The scoring function mirrors the Random Forest feature importance distribution. Identifier semantics carry the highest weight, followed by entropy normalization, pattern matching, and length. This prevents high-entropy non-secrets from triggering alerts while ensuring low-entropy credentials are caught.
- Contextual Override: The
contextType field enables downstream filtering. ORM fields, test fixtures, and configuration templates can be excluded at the CI/CD layer without modifying the core scanner.
Pitfall Guide
1. ORM and Schema Field False Positives
Explanation: Frameworks like Django, Rails, or TypeORM define model attributes named password, token, or secret. These are schema definitions, not credential assignments. Scanners that only check identifier names will flag them.
Fix: Inspect the parent node type. If the assignment target is a class property definition, model field, or decorator argument, suppress the alert. The feature vector should include a isSchemaDefinition flag derived from AST context.
2. Obfuscated or Random Identifiers
Explanation: Malicious actors or careless developers may assign credentials to generic names like data_1, temp, or x. The semantic score drops to baseline, relying entirely on entropy and pattern matching.
Fix: Implement a secondary heuristic: if identifierScore < 0.35 but patternMatch === true and entropy > 3.5, escalate to WARN. Additionally, scan configuration files and environment templates where obfuscation is less common.
3. Internationalization Gaps
Explanation: The semantic lexicon is English-centric. Identifiers like Passwort, motDePasse, or senha default to the neutral 0.30 score, reducing detection accuracy in multinational codebases.
Fix: Maintain a locale-aware extension map. Allow teams to inject regional vocabulary via configuration. The scoring engine should support dynamic lexicon merging without recompilation.
4. Ignoring Parent Node Context
Explanation: A string literal inside a function parameter named token may be a routing parameter, not a credential. Similarly, config.key might reference a feature flag, not an API key.
Fix: Require context validation before scoring. If the parent is a function parameter, route handler, or feature flag definition, apply a context penalty to the identifier score. Use AST type checking to distinguish assignment from declaration.
5. Threshold Tuning Missteps
Explanation: Hardcoding a single risk threshold (e.g., >= 0.6) causes either alert fatigue or missed detections. Different repositories have different risk tolerances.
Fix: Implement tiered thresholds with repository-level overrides. Use BLOCK for high-confidence patterns (patternMatch === true), WARN for semantic-heavy cases, and ALLOW for low-risk contexts. Expose threshold configuration in CI/CD pipelines.
6. Treating All High-Entropy Strings Equally
Explanation: Base64-encoded images, cryptographic hashes, and serialized payloads all exhibit high entropy. Flagging them creates noise and erodes trust in the scanner.
Fix: Add a content-type heuristic. If the string matches base64 padding patterns, hex digest formats, or known serialization prefixes, apply an entropy discount. The feature vector should include a contentType classification.
7. Static Rule Hardcoding vs Adaptive Scoring
Explanation: Hardcoding weights and thresholds makes the scanner brittle. As codebases evolve, new naming conventions emerge, and static rules degrade.
Fix: Decouple the scoring engine from the extraction layer. Allow the evaluation function to be swapped with a lightweight ML model or rule engine. Store feature weights in external configuration and version them alongside the scanner.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Monorepo with mixed frameworks | AST + Semantic Lexicon + Context Override | Prevents ORM/schema false positives across diverse codebases | Low (configuration-driven) |
| Legacy codebase with obfuscated names | Entropy + Pattern Match Fallback | Semantic scores are unreliable; cryptographic signals carry weight | Medium (higher false positives) |
| High-compliance environment (SOC2, HIPAA) | Pre-commit Hook + Tiered Thresholds | Blocks credentials before commit; audit trail meets compliance | Low (developer friction minimal) |
| International team / Non-English codebase | Locale-Extended Lexicon + Dynamic Scoring | Captures regional credential naming conventions | Low (configuration update) |
| CI/CD pipeline integration | Lightweight Scanning Mode + Async Reporting | Reduces pipeline latency; defers detailed analysis to post-merge | Low (infrastructure cost neutral) |
Configuration Template
# secrets-scanner.config.yaml
scanner:
mode: semantic-hybrid
threshold:
block: 0.65
warn: 0.45
features:
identifier_weight: 0.28
entropy_weight: 0.22
pattern_weight: 0.35
length_weight: 0.15
lexicon:
credentials:
- password
- secret_key
- api_token
- private_key
- connection_string
non_sensitive:
- checksum
- uuid
- version
- integrity
abbreviations:
- pass
- sk
- cs
- tkn
- cred
context_overrides:
suppress_on:
- VariableDeclarator.schema_field
- Property.route_parameter
- Decorator.model_attribute
exclude_directories:
- __tests__/
- mocks/
- fixtures/
ci_integration:
hook: pre-commit
timeout_ms: 2000
report_format: sarif
allow_override: false
Quick Start Guide
- Install Dependencies: Add
@typescript-eslint/parser and @typescript-eslint/types to your project. Ensure Node.js 18+ is available.
- Initialize Configuration: Copy the YAML template into your repository root. Adjust thresholds and lexicon entries to match your team's naming conventions.
- Create Pre-Commit Hook: Use
husky or simple-git-hooks to run the scanner before git commit. Configure it to exit with status 1 on BLOCK findings.
- Validate with Test Cases: Run the scanner against a sample file containing known credentials, hashes, and ORM definitions. Verify that semantic scoring suppresses false positives and catches low-entropy secrets.
- Deploy to CI/CD: Add the scanner as a pipeline step with
report_format: sarif. Integrate with your security dashboard for trend analysis and threshold tuning.
The shift from syntactic pattern matching to semantic-aware classification is not a theoretical exercise. It is a direct response to how credentials actually enter codebases. Developers label what they know. Capturing that label transforms secrets detection from a noisy audit into a precise, automated gate. Implement the AST extraction, weight the identifier signal, and enforce pre-commit interception. The architecture scales, the false positives drop, and the security posture strengthens without adding developer friction.