before processing. This decouples ingestion from execution and ensures consistent routing.
interface SecurityEvent {
id: string;
timestamp: string;
source: string; // e.g., 'edr', 'cloudtrail', 'siem'
eventType: string;
severity: 'low' | 'medium' | 'high' | 'critical';
entity: {
type: 'host' | 'user' | 'ip' | 'domain';
value: string;
metadata: Record<string, unknown>;
};
context: Record<string, unknown>;
}
export const normalizeEvent = (raw: unknown): SecurityEvent => {
if (!raw || typeof raw !== 'object') throw new Error('Invalid event payload');
const payload = raw as Record<string, unknown>;
return {
id: crypto.randomUUID(),
timestamp: new Date().toISOString(),
source: String(payload.source || 'unknown'),
eventType: String(payload.event_type || 'unknown'),
severity: (['low', 'medium', 'high', 'critical'].includes(String(payload.severity))
? String(payload.severity) as SecurityEvent['severity']
: 'low'),
entity: {
type: String(payload.entity_type || 'host') as SecurityEvent['entity']['type'],
value: String(payload.entity_value || ''),
metadata: (payload.metadata || {}) as Record<string, unknown>,
},
context: (payload.context || {}) as Record<string, unknown>,
};
};
Step 2: Enrichment & Triage
Enrichment transforms raw signals into actionable context. Query threat intelligence, asset inventory, and historical behavior databases. Apply deterministic scoring to determine if automation should proceed.
import { Redis } from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
export class EnrichmentService {
async enrich(event: SecurityEvent): Promise<SecurityEvent & { riskScore: number; approved: boolean }> {
const [threatIntel, assetProfile, history] = await Promise.all([
this.queryThreatIntel(event.entity.value),
this.queryAssetInventory(event.entity.value),
this.queryHistoricalBehavior(event.entity.value),
]);
const riskScore = this.calculateRiskScore(event, threatIntel, assetProfile, history);
const approved = riskScore >= 75 && !assetProfile.isProductionCritical;
await redis.setex(`event:${event.id}`, 3600, JSON.stringify({ ...event, riskScore, approved }));
return { ...event, riskScore, approved };
}
private calculateRiskScore(
event: SecurityEvent,
threatIntel: { isMalicious: boolean; confidence: number },
asset: { criticality: number; lastPatch: string },
history: { previousIncidents: number; avgResponseTime: number }
): number {
let score = 0;
if (event.severity === 'critical') score += 30;
if (event.severity === 'high') score += 20;
if (threatIntel.isMalicious) score += threatIntel.confidence * 0.5;
if (asset.criticality > 8) score -= 15; // Lower auto-approval for critical assets
if (history.previousIncidents > 3) score += 10;
return Math.min(100, Math.max(0, score));
}
}
Step 3: Playbook Orchestration
Playbooks must be stateful, idempotent, and support conditional branching. Use a deterministic runner that validates execution prerequisites, applies blast-radius controls, and logs every action.
export interface PlaybookAction {
id: string;
type: 'isolate_host' | 'revoke_token' | 'block_ip' | 'create_ticket';
target: string;
parameters: Record<string, unknown>;
rollback?: PlaybookAction;
}
export class PlaybookRunner {
private readonly executionLog: Map<string, Set<string>> = new Map();
async execute(eventId: string, actions: PlaybookAction[]): Promise<void> {
const executed = this.executionLog.get(eventId) ?? new Set();
for (const action of actions) {
if (executed.has(action.id)) continue; // Idempotency guard
try {
await this.executeAction(action);
executed.add(action.id);
this.executionLog.set(eventId, executed);
console.info(`Executed action ${action.id} for event ${eventId}`);
} catch (err) {
console.error(`Action ${action.id} failed: ${(err as Error).message}`);
if (action.rollback) {
await this.executeAction(action.rollback);
console.warn(`Executed rollback for action ${action.id}`);
}
throw err;
}
}
}
private async executeAction(action: PlaybookAction): Promise<void> {
switch (action.type) {
case 'isolate_host':
// EDR API call with timeout & retry
break;
case 'revoke_token':
// IAM/Identity provider call
break;
case 'block_ip':
// WAF/Firewall API call
break;
case 'create_ticket':
// Jira/ServiceNow integration
break;
default:
throw new Error(`Unsupported action type: ${action.type}`);
}
}
}
Architecture Decisions & Rationale
- Event-Driven Decoupling: Ingestion, enrichment, and execution run as independent services communicating via message queues or event buses. This prevents cascading failures and allows independent scaling during alert storms.
- Idempotency Enforcement: Security actions must be safe to retry. The runner tracks executed action IDs per event, preventing duplicate isolations, revocations, or firewall blocks that could cause operational disruption.
- Deterministic Scoring over ML-Only Triage: Machine learning models introduce opacity and drift. A hybrid approach uses deterministic risk scoring for automation gates, reserving ML for anomaly detection and post-incident analysis. This ensures predictable blast radius and compliance auditability.
- Stateless Execution with External State: The runner remains stateless; execution history, risk scores, and playbook states are persisted in Redis or a durable store. This enables horizontal scaling, crash recovery, and audit trail generation without coupling execution logic to storage.
Pitfall Guide
-
Automating Without Blast-Radius Controls
Executing containment actions on production-critical assets without validation causes outages. Always implement asset criticality checks and environment-aware routing before triggering remediation.
-
Ignoring Alert Storms & Correlation
Running playbooks per raw alert floods APIs and exhausts rate limits. Implement event correlation windows (e.g., 5-minute deduplication) and circuit breakers that pause automation when event velocity exceeds thresholds.
-
State Drift & Missing Idempotency
Retrying failed actions without tracking execution state results in duplicate blocks, revoked tokens, or isolated hosts. Enforce idempotency keys and maintain an execution ledger.
-
Over-Reliance on Single Enrichment Source
Depending on one threat intelligence feed or asset inventory creates blind spots. Aggregate multiple sources with fallback scoring and cache enrichment results to reduce latency.
-
Bypassing Audit Trails
Security automation must be fully auditable. Every decision, score calculation, action execution, and rollback must be logged with timestamps, actor/service identity, and input parameters.
-
No Rollback or Compensating Actions
Automation failures leave systems in inconsistent states. Define explicit rollback actions for every containment step and test them during tabletop exercises.
-
Misplaced Human-in-the-Loop Gates
Requiring manual approval for low-severity events creates bottlenecks; skipping approval for critical assets introduces risk. Route based on severity, asset criticality, and historical confidence scores.
Best Practices from Production:
- Version control all playbook definitions and enforce peer review before deployment.
- Implement dry-run mode for new playbooks; log actions without executing them for 7-14 days.
- Use structured logging with correlation IDs to trace events from ingestion to remediation.
- Run monthly automation health checks: verify API credentials, validate enrichment cache freshness, and test rollback paths.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-severity, high-volume alerts | Fully automated triage & containment | Reduces engineer burnout; predictable blast radius | ↓ 65% operational cost |
| Critical production assets | Human-in-the-loop with auto-enrichment | Prevents service disruption while accelerating decision context | ↑ 15% tooling cost, ↓ 40% breach cost |
| Compliance-driven environments | Deterministic playbooks with dry-run validation | Ensures auditability and regulatory alignment | Neutral to ↓ 10% audit overhead |
| Resource-constrained teams | Rule-based automation + managed enrichment SaaS | Minimizes maintenance while delivering immediate MTTR reduction | ↑ 20% SaaS cost, ↓ 50% headcount pressure |
Configuration Template
playbook:
id: IR-AUTO-042
name: "Credential Compromise Containment"
version: "1.3.0"
triggers:
- source: "edr"
event_type: "credential_theft"
severity: ["high", "critical"]
conditions:
asset_criticality: "<= 7"
environment: ["staging", "dev", "sandbox"]
risk_threshold: 75
actions:
- id: "revoke-session"
type: "revoke_token"
target: "{{ entity.value }}"
parameters:
provider: "identity_platform"
scope: "active_sessions"
rollback:
id: "restore-session"
type: "revoke_token"
parameters:
provider: "identity_platform"
scope: "active_sessions"
action: "restore"
- id: "notify-channel"
type: "create_ticket"
target: "{{ entity.value }}"
parameters:
system: "jira"
project: "SEC"
priority: "{{ severity }}"
labels: ["auto-remediated", "credential-theft"]
gates:
human_approval: false
dry_run: false
audit_log: true
execution:
idempotency: true
timeout_seconds: 30
retry_policy:
max_attempts: 2
backoff_ms: 1000
Quick Start Guide
- Deploy the orchestration runner: Containerize the TypeScript playbook runner and deploy it to your Kubernetes cluster or serverless platform. Configure environment variables for Redis, EDR, and identity provider APIs.
- Connect your ingestion webhook: Route SIEM/EDR alerts to the runner’s
/ingest endpoint. Ensure payloads include source, event_type, severity, and entity fields matching the normalized schema.
- Load the baseline playbook: Import the configuration template above via the runner’s
/playbooks API. Enable dry-run mode initially to validate scoring and routing without executing actions.
- Execute a controlled test: Trigger a simulated credential theft event using your EDR’s test console or a curl payload. Verify enrichment scoring, idempotency logging, and audit trail generation. Switch
dry_run: false once validation passes.