Parse IaC plan output or state file into a normalized resource graph.
2. Actual State Collection: Query cloud APIs with pagination, rate limiting, and credential rotation.
3. Delta Computation: Compare desired vs actual, filtering dynamic attributes (timestamps, auto-generated IDs, system tags).
Step 3: TypeScript Drift Scanner Implementation
While IaC tools are typically Go/HCL-based, a TypeScript drift reconciliation service provides type safety, native JSON handling, and seamless CI/CD integration. The following example demonstrates a production-ready drift detector using AWS SDK v3 and structured diff logic.
import { EC2Client, DescribeInstancesCommand } from "@aws-sdk/client-ec2";
import { SSMClient, GetParametersCommand } from "@aws-sdk/client-ssm";
import { createHash } from "crypto";
interface ResourceState {
id: string;
type: string;
tags: Record<string, string>;
config: Record<string, unknown>;
lastModified: string;
}
interface DriftReport {
resource: string;
driftType: "missing" | "modified" | "unmanaged";
severity: "critical" | "warning" | "info";
details: string;
timestamp: string;
}
export class DriftDetector {
private ec2: EC2Client;
private ssm: SSMClient;
private dynamicFields: Set<string> = new Set([
"launchTime", "instanceId", "arn", "privateDnsName", "systemTags"
]);
constructor(region: string) {
this.ec2 = new EC2Client({ region });
this.ssm = new SSMClient({ region });
}
async scanDesiredState(planPath: string): Promise<ResourceState[]> {
// In production, parse terraform plan -json or pulumi stack export
const plan = JSON.parse(require("fs").readFileSync(planPath, "utf-8"));
return plan.resource_changes.map((rc: any) => ({
id: rc.change.after?.id || rc.address,
type: rc.type,
tags: rc.change.after?.tags || {},
config: this.normalizeConfig(rc.change.after),
lastModified: new Date().toISOString()
}));
}
async scanActualState(): Promise<ResourceState[]> {
const instances = await this.ec2.send(new DescribeInstancesCommand({}));
const resources: ResourceState[] = [];
for (const res of instances.Reservations ?? []) {
for (const inst of res.Instances ?? []) {
resources.push({
id: inst.InstanceId!,
type: "aws_instance",
tags: Object.fromEntries(
(inst.Tags ?? []).map(t => [t.Key!, t.Value!])
),
config: {
instanceType: inst.InstanceType,
securityGroups: inst.SecurityGroups?.map(sg => sg.GroupId),
subnetId: inst.SubnetId
},
lastModified: inst.LaunchTime?.toISOString() ?? ""
});
}
}
return resources;
}
async detectDrift(desired: ResourceState[], actual: ResourceState[]): Promise<DriftReport[]> {
const reports: DriftReport[] = [];
const actualMap = new Map(actual.map(r => [r.id, r]));
for (const d of desired) {
const a = actualMap.get(d.id);
if (!a) {
reports.push({
resource: d.id,
driftType: "missing",
severity: "critical",
details: "Resource exists in IaC but not in cloud",
timestamp: new Date().toISOString()
});
continue;
}
const configDiff = this.compareConfigs(d.config, a.config);
if (configDiff.length > 0) {
reports.push({
resource: d.id,
driftType: "modified",
severity: "warning",
details: `Config drift: ${configDiff.join(", ")}`,
timestamp: new Date().toISOString()
});
}
}
// Detect unmanaged resources
const desiredIds = new Set(desired.map(d => d.id));
for (const a of actual) {
if (!desiredIds.has(a.id) && !a.id.startsWith("drift-ignore-")) {
reports.push({
resource: a.id,
driftType: "unmanaged",
severity: "info",
details: "Resource exists in cloud but not tracked by IaC",
timestamp: new Date().toISOString()
});
}
}
return reports;
}
private normalizeConfig(config: Record<string, unknown>): Record<string, unknown> {
const normalized: Record<string, unknown> = {};
for (const [key, value] of Object.entries(config)) {
if (!this.dynamicFields.has(key) && value !== undefined && value !== null) {
normalized[key] = typeof value === "object" ? JSON.stringify(value) : value;
}
}
return normalized;
}
private compareConfigs(desired: Record<string, unknown>, actual: Record<string, unknown>): string[] {
const diffs: string[] = [];
for (const [key, val] of Object.entries(desired)) {
const actualVal = actual[key];
if (actualVal === undefined) {
diffs.push(`${key}: missing`);
} else if (JSON.stringify(val) !== JSON.stringify(actualVal)) {
diffs.push(`${key}: desired=${JSON.stringify(val)}, actual=${JSON.stringify(actualVal)}`);
}
}
return diffs;
}
}
Step 4: Architecture Decisions & Rationale
- Read-First Pattern: Drift scanners must never modify state. Remediation is gated behind approval workflows or automated pipelines with blast-radius controls.
- Dynamic Field Filtering: Cloud providers inject ephemeral data (timestamps, auto-generated DNS, system tags). Ignoring these fields reduces false positives by 60-70%.
- State Graph Normalization: IaC state files and cloud API responses use different schemas. Normalization into a canonical resource graph enables consistent diffing across providers.
- Idempotent Comparison: Hash-based or deep-equality checks prevent flaky detections caused by API ordering or metadata serialization differences.
- Event vs Polling Hybrid: Scheduled polling covers baseline drift. Control plane events (AWS Config, GCP Audit Logs, Azure Activity Log) trigger immediate scans for high-risk resource classes (IAM, Security Groups, KMS).
Pitfall Guide
1. Treating All Drift as Malicious
Manual changes often resolve production incidents faster than pipeline cycles. Classifying every delta as a violation creates alert fatigue. Implement drift taxonomy: critical (security/compliance), warning (configuration mismatch), info (cosmetic/untracked). Route only critical/warning to incident channels.
2. Ignoring Dynamic and Ephemeral Attributes
Cloud APIs return auto-generated IDs, timestamps, and system tags that will never match IaC declarations. Failing to filter these fields causes persistent false positives. Maintain a provider-specific ignore list and validate it during platform upgrades.
3. Using State Files as the Sole Source of Truth
State files can be corrupted, manually edited, or desynchronized from the control plane. Always validate state integrity before scanning. Implement state checksums, version pinning, and backup restoration procedures. Drift detection should compare IaC plan output against live infrastructure, not just state vs cloud.
4. Running Drift Scans Without Concurrency Controls
Parallel terraform plan executions or unthrottled API calls cause state lock contention and provider rate limit exhaustion. Serialize drift scans per workspace, implement exponential backoff, and use read-only service principals with scoped permissions.
5. Missing Cross-Account and Multi-Region Scope
Drift often occurs in secondary accounts, shared VPCs, or global resources (Route 53, IAM roles, CloudFront). Scanning only primary accounts creates blind spots. Deploy drift scanners with cross-account role assumption and region enumeration loops.
Auto-applying terraform apply on drift detection can cascade failures, overwrite manual hotfixes, or trigger resource replacement storms. Implement staged remediation: detect β triage β approve β apply β verify. Use policy-as-code (OPA, Sentinel) to block destructive changes automatically.
7. No Drift Audit Trail or Trend Analysis
Drift is a symptom, not a root cause. Without logging drift frequency, resource classes, and responsible actors, teams cannot address process gaps. Store drift reports in a time-series database or SIEM. Correlate with deployment logs to identify pipeline gaps or console abuse patterns.
Best Practice: Treat drift detection as a continuous compliance control, not a deployment gate. Run scans on a cadence matching your risk tolerance (hourly for regulated workloads, daily for standard). Tag resources with drift-tolerance: strict/relaxed to enable policy-driven scanning intensity.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup MVP (Single Account, <50 Resources) | Scheduled Daily Scanning + Manual Triage | Low overhead, sufficient for small blast radius, avoids over-engineering | <$50/mo (API calls + storage) |
| Regulated Enterprise (PCI/HIPAA, Multi-Account) | Event-Driven Continuous + Policy-Gated Auto-Remediation | Compliance requires near-zero MTTD, audit trails, and enforced reconciliation | $200-$800/mo (event streaming, policy engines, SIEM ingestion) |
| Multi-Cloud Platform (AWS/GCP/Azure) | Normalized Drift Graph + Provider-Agnostic Scanner | Unified comparison logic prevents toolchain fragmentation and reduces maintenance | $150-$400/mo (custom scanner runtime, cross-cloud IAM) |
Configuration Template
# .github/workflows/drift-detection.yml
name: Infrastructure Drift Detection
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
workflow_dispatch:
env:
TF_WORKSPACE: production
AWS_REGION: us-east-1
jobs:
detect-drift:
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.8.5
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/drift-scanner
aws-region: ${{ env.AWS_REGION }}
- name: Initialize & Plan
run: |
terraform init -input=false
terraform workspace select ${{ env.TF_WORKSPACE }}
terraform plan -detailed-exitcode -no-color -out=tfplan || true
- name: Check Drift Status
id: drift
run: |
if [ $? -eq 2 ]; then
echo "drift_detected=true" >> $GITHUB_OUTPUT
echo "::warning ::Infrastructure drift detected in ${{ env.TF_WORKSPACE }}"
else
echo "drift_detected=false" >> $GITHUB_OUTPUT
fi
- name: Generate Drift Report
if: steps.drift.outputs.drift_detected == 'true'
run: |
terraform show -json tfplan > drift-report.json
jq '.resource_changes[] | select(.change.actions != ["no-op"])' drift-report.json > drift-deltas.json
- name: Notify Slack
if: steps.drift.outputs.drift_detected == 'true'
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{
"text": "π¨ Drift Detected",
"blocks": [
{"type": "section", "text": {"type": "mrkdwn", "text": "*Workspace:* `${{ env.TF_WORKSPACE }}`\n*Environment:* `${{ env.AWS_REGION }}`\n*Action:* Review `drift-deltas.json` in workflow artifacts"}},
{"type": "divider"}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_DRIFT_WEBHOOK }}
Quick Start Guide
- Initialize State Backend: Configure remote state with locking and encryption. Export read-only credentials for the scanner service.
- Deploy Scheduled Workflow: Copy the GitHub Actions template above. Replace
role-to-assume, AWS_REGION, and SLACK_WEBHOOK_URL with your values. Commit to .github/workflows/.
- Validate Detection: Manually trigger the workflow. Intentionally modify a non-critical resource via console. Re-run the workflow to confirm drift reporting and Slack notification.
- Configure Remediation Policy: Add an OPA/Sentinel policy or manual approval gate to your pipeline. Route critical drift alerts to your incident management system. Schedule daily runs for production, hourly for regulated environments.