rade L4 investigation pipeline requires five architectural decisions that prioritize safety, observability, and multi-cloud reach. The following implementation demonstrates a TypeScript-based orchestrator that follows the ReAct pattern (Reason + Act), manages state across tool executions, and enforces strict permission boundaries.
The agent must only expose commands that align with its operational mandate. Unrestricted CLI access is the fastest path to production outages.
interface ToolDefinition {
name: string;
description: string;
parameters: Record<string, string>;
execute: (params: Record<string, string>) => Promise<string>;
requiresApproval: boolean;
}
class ToolRegistry {
private tools: Map<string, ToolDefinition> = new Map();
register(tool: ToolDefinition): void {
this.tools.set(tool.name, tool);
}
getTool(name: string): ToolDefinition | undefined {
return this.tools.get(name);
}
listAvailable(): string[] {
return Array.from(this.tools.keys());
}
}
Step 2: Implement Sandboxed Execution
Every tool call must run in an isolated environment with network egress controls, resource limits, and command allowlists.
class SandboxExecutor {
private maxTimeoutMs: number;
private allowedNamespaces: string[];
constructor(config: { timeoutMs: number; namespaces: string[] }) {
this.maxTimeoutMs = config.timeoutMs;
this.allowedNamespaces = config.namespaces;
}
async runCommand(command: string, namespace: string): Promise<string> {
if (!this.allowedNamespaces.includes(namespace)) {
throw new Error(`Namespace ${namespace} not permitted in sandbox`);
}
const child = spawn(command, { shell: true, timeout: this.maxTimeoutMs });
return new Promise((resolve, reject) => {
let output = '';
child.stdout?.on('data', (d) => (output += d.toString()));
child.stderr?.on('data', (d) => (output += d.toString()));
child.on('close', (code) => {
if (code === 0) resolve(output.trim());
else reject(new Error(`Command failed with exit code ${code}: ${output}`));
});
});
}
}
Step 3: Build the ReAct Investigation Loop
The orchestrator maintains an evidence chain, queries an LLM for the next action, executes it, and updates its internal state until confidence thresholds are met.
interface InvestigationState {
incidentId: string;
evidence: string[];
currentHypothesis: string;
stepsTaken: number;
maxSteps: number;
confidenceThreshold: number;
}
class InvestigationOrchestrator {
private state: InvestigationState;
private registry: ToolRegistry;
private executor: SandboxExecutor;
private llmClient: any; // Abstracted LLM provider
constructor(config: { incidentId: string; registry: ToolRegistry; executor: SandboxExecutor; llm: any }) {
this.state = {
incidentId: config.incidentId,
evidence: [],
currentHypothesis: 'Initial assessment pending',
stepsTaken: 0,
maxSteps: 12,
confidenceThreshold: 0.85,
};
this.registry = config.registry;
this.executor = config.executor;
this.llmClient = config.llm;
}
async run(): Promise<{ hypothesis: string; evidence: string[]; trace: string[] }> {
const trace: string[] = [];
while (this.state.stepsTaken < this.state.maxSteps) {
const prompt = this.buildReasoningPrompt();
const response = await this.llmClient.generate(prompt);
const action = this.parseAction(response);
if (action.type === 'finalize') {
this.state.currentHypothesis = action.hypothesis;
trace.push(`[FINAL] Hypothesis: ${action.hypothesis}`);
break;
}
const tool = this.registry.getTool(action.toolName);
if (!tool) {
trace.push(`[ERROR] Tool ${action.toolName} not registered`);
continue;
}
trace.push(`[STEP ${this.state.stepsTaken}] Calling ${action.toolName} with ${JSON.stringify(action.params)}`);
const result = await tool.execute(action.params);
this.state.evidence.push(result);
this.state.stepsTaken++;
trace.push(`[RESULT] ${result.substring(0, 200)}...`);
}
return {
hypothesis: this.state.currentHypothesis,
evidence: this.state.evidence,
trace,
};
}
private buildReasoningPrompt(): string {
return `
Incident: ${this.state.incidentId}
Current Hypothesis: ${this.state.currentHypothesis}
Evidence Collected: ${this.state.evidence.length} items
Available Tools: ${this.registry.listAvailable().join(', ')}
Max Steps Remaining: ${this.state.maxSteps - this.state.stepsTaken}
Analyze the current evidence. Either request a tool call to gather missing data, or finalize the root-cause hypothesis if confidence exceeds ${this.state.confidenceThreshold}.
Respond in JSON: {"type": "tool_call"|"finalize", "toolName": "...", "params": {...}, "hypothesis": "..."}
`;
}
private parseAction(response: string): any {
try {
return JSON.parse(response);
} catch {
return { type: 'finalize', hypothesis: 'Parse error, defaulting to current hypothesis' };
}
}
}
Architecture Decisions & Rationale
- ReAct over Chain-of-Thought: Multi-turn tool execution requires stateful reasoning. The ReAct pattern forces the model to interleave reasoning with action, preventing hallucination-heavy monologues that ignore live system state.
- Sandboxed Execution: Direct CLI access is non-negotiable for investigation but dangerous in production. The sandbox enforces namespace isolation, timeout boundaries, and command allowlists, converting arbitrary execution into auditable, bounded operations.
- Evidence Chain over Single Output: Storing every tool response creates a verifiable audit trail. This directly supports FDRT measurement and compliance reviews, which single-shot diagnostics cannot provide.
- Confidence Thresholds: Hard-coding a minimum confidence level prevents premature finalization. The agent continues gathering data until statistical or logical thresholds are met, reducing false RCA assignments.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Unscoped CLI Permissions | Granting the agent root or cluster-admin access turns diagnostic commands into potential destructive operations. | Implement RBAC-aligned tool wrappers. Restrict to get, describe, logs, and top. Never expose delete, apply, or exec without L5 human approval. |
| Single-Cloud Tool Blindness | Agents trained only on Kubernetes miss incidents originating in cloud control planes, DNS, or CI/CD runners. | Build a multi-provider abstraction layer. Register cloud CLI wrappers (AWS/GCP/Azure), DNS probes, and pipeline status checkers as first-class tools. |
| Hallucination-Driven Remediation | L4 agents may confidently propose fixes that violate infrastructure policies or introduce configuration drift. | Enforce dry-run mode for all remediation suggestions. Require explicit human approval before any state-changing command executes (L5 boundary). |
| Ignoring Agent Observability | Treating the agent as a black box makes it impossible to debug why it chose a specific tool sequence or missed a dependency. | Emit structured logs with trace IDs, step counts, token usage, and tool latency. Integrate with OpenTelemetry for distributed tracing across agent and infra calls. |
| RAG Staleness | Runbooks and past postmortems drift faster than infrastructure. Stale context causes the agent to reference deprecated APIs or retired services. | Automate post-incident ingestion into the vector store. Set TTL policies on RAG documents. Validate context freshness before each investigation run. |
| Metric Misalignment | Optimizing for "time to first response" instead of FDRT encourages quick but inaccurate diagnoses, increasing deployment rework rate. | Track FDRT and rework rate as primary KPIs. Measure agent accuracy against human-validated RCAs. Penalize false positives in performance reviews. |
| Over-Automation at L4 | Pushing directly to L5 without mature approval gates causes policy violations and audit failures in regulated environments. | Start at L4 read-only. Introduce L5 approval workflows incrementally. Require dual-signoff for production remediation until accuracy exceeds 90%. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Kubernetes-native environment with strict compliance | CNCF Sandbox tool (e.g., HolmesGPT) + L4 read-only | Pre-built RBAC, audit trails, and K8s-first tooling reduce integration overhead | Low integration cost, moderate token spend |
| Multi-cloud hybrid (AWS/Azure/GCP + K8s) | Self-hosted agentic framework with multi-provider tool registry | Single control plane across clouds prevents tool fragmentation and context switching | Higher initial setup, lower long-term licensing |
| Enterprise with SOC-2/ISO audit requirements | L4 investigation + L5 approval gateway + full trace logging | Compliance mandates human oversight and verifiable evidence chains | Increased operational overhead, reduced rework rate |
| Startup with limited SRE headcount | Commercial SaaS with managed RAG and auto-remediation | Offloads infrastructure maintenance and accelerates time-to-value | Higher per-incident cost, faster FDRT reduction |
Configuration Template
# investigation-agent-config.yaml
agent:
mode: L4_READ_ONLY
max_steps: 12
confidence_threshold: 0.85
trace_format: otel_json
sandbox:
timeout_ms: 30000
allowed_namespaces:
- production
- staging
blocked_commands:
- delete
- apply
- exec
- port-forward
tools:
kubernetes:
enabled: true
api_version: v1
allowed_resources: [pods, deployments, services, events, logs]
cloud_aws:
enabled: true
regions: [us-east-1, eu-west-1]
allowed_actions: [describe_instances, describe_log_groups, get_metric_data]
cloud_gcp:
enabled: true
allowed_actions: [list_instances, read_logs, get_monitoring_metrics]
rag:
provider: weaviate
collection: incident_runbooks
ttl_days: 90
ingestion:
source: postmortem_pipeline
schedule: "0 2 * * *"
observability:
tracing:
enabled: true
exporter: otlp_http
endpoint: http://otel-collector:4318
metrics:
fdrt_tracking: true
rework_rate_alert: true
Quick Start Guide
- Initialize the tool registry: Clone the agentic framework repository, configure
investigation-agent-config.yaml with your cloud credentials and namespace restrictions, and run npm run setup-tools to validate connectivity.
- Deploy the sandbox: Use the provided Helm chart or Docker Compose file to launch the execution environment. Verify that blocked commands return permission errors and allowed commands return structured output.
- Connect observability: Point the OpenTelemetry exporter to your existing tracing backend. Run a synthetic incident simulation and confirm that step traces, token counts, and tool latencies appear in your dashboard.
- Validate in read-only mode: Trigger the agent against a staging incident. Review the evidence chain, verify hypothesis accuracy, and confirm that no state-changing commands execute. Once FDRT and accuracy metrics stabilize, proceed to L5 approval gating.