Step 1: Verify Cgroup Accounting Health
Cgroup memory leaks manifest as orphaned entries in /sys/fs/cgroup/memory (v1) or /sys/fs/cgroup/memory (v2). A TypeScript-based diagnostic utility can parse these paths, detect abnormal growth patterns, and flag processes with stale cgroup references.
import { readFileSync, readdirSync, statSync } from 'fs';
import { join } from 'path';
interface CgroupMetric {
processId: string;
memoryUsageBytes: number;
leakProbability: number;
}
const CGROUP_V1_PATH = '/sys/fs/cgroup/memory';
const THRESHOLD_BYTES = 500 * 1024 * 1024; // 500MB baseline
function scanCgroupEntries(): CgroupMetric[] {
const entries = readdirSync(CGROUP_V1_PATH);
const metrics: CgroupMetric[] = [];
for (const entry of entries) {
const statPath = join(CGROUP_V1_PATH, entry, 'memory.usage_in_bytes');
try {
const usage = parseInt(readFileSync(statPath, 'utf-8').trim(), 10);
const pid = entry.match(/(\d+)/)?.[1] || 'unknown';
const leakProbability = usage > THRESHOLD_BYTES ? 0.85 : 0.15;
metrics.push({ processId: pid, memoryUsageBytes: usage, leakProbability });
} catch {
// Skip inaccessible or kernel-managed cgroups
}
}
return metrics.filter(m => m.leakProbability > 0.7);
}
export { scanCgroupEntries, CgroupMetric };
This utility isolates high-probability leak candidates by comparing current memory usage against a configurable baseline. It avoids parsing transient kernel cgroups and focuses on user-space process references. The leakProbability heuristic flags entries that exceed normal allocation patterns, which typically indicate abandoned cgroup hierarchies.
Step 2: Correlate Memory Pressure with CPU Throttling
Memory cgroup leaks trigger kernel direct reclaim, which consumes CPU cycles and forces the scheduler to throttle container workloads. A PromQL query can correlate node_memory_KswapdWriteback and container_cpu_cfs_throttled_seconds_total to validate the relationship.
# Detect correlation between kernel memory reclaim and container CPU throttling
(
rate(node_memory_KswapdWriteback_bytes_total[5m]) > 0
)
and
(
rate(container_cpu_cfs_throttled_seconds_total{namespace="ml-training"}[5m]) > 0.5
)
This query identifies time windows where kernel memory reclaim activity aligns with container CPU throttling. If both metrics spike simultaneously, the bottleneck is accounting corruption, not compute shortage. Teams should avoid scaling nodes during this window, as additional capacity will inherit the same cgroup leak.
Step 3: Audit Background Agents
Orphaned platform agents are the primary source of cgroup leaks. A node-level audit script should enumerate running processes, cross-reference them with active workloads, and flag unused services.
#!/usr/bin/env bash
# audit_node_agents.sh
# Identifies platform agents running without active workload dependencies
AGENT_LIST=("ecs-agent" "datadog-agent" "fluentd" "node-exporter" "kube-proxy")
ACTIVE_PIDS=$(pgrep -f "k8s_" || true)
for agent in "${AGENT_LIST[@]}"; do
AGENT_PID=$(pgrep -f "$agent" || true)
if [[ -n "$AGENT_PID" && -z "$ACTIVE_PIDS" ]]; then
echo "[ALERT] Orphaned agent detected: $agent (PID: $AGENT_PID)"
echo " -> No active Kubernetes workloads found. Safe to disable."
fi
done
This script checks for known platform agents and verifies whether active Kubernetes containers are running. If an agent is active but no workloads exist, it is a candidate for immediate disablement. The logic prevents accidental termination of essential monitoring or networking components.
Step 4: Disable and Validate
Once an orphaned agent is identified, disable it via systemd override or Kubernetes node configuration. Validate that cgroup entries are reclaimed and CPU throttling subsides.
# systemd override for ecs-agent
[Service]
ExecStart=
ExecStart=/bin/true
Restart=no
After applying the override, restart the agent service and monitor /sys/fs/cgroup/memory for entry cleanup. CPU throttling should normalize within 2-3 minutes as the kernel reclaims orphaned cgroup references.
Architecture Decisions and Rationale
- Node-level auditing over pod-level monitoring: Cgroup leaks originate outside the container runtime. Pod metrics cannot detect kernel accounting corruption. Node-level inspection is mandatory.
- TypeScript diagnostic utility over shell scripts: TypeScript provides type safety, structured error handling, and easier integration with existing observability pipelines. Shell scripts are retained for quick node audits due to lower overhead.
- Disable over patch: Legacy agents often lack active maintenance. Patching memory leaks requires kernel-level changes or agent recompilation. Disabling unused agents is faster, safer, and eliminates the root cause.
- PromQL correlation over single-metric alerts: CPU throttling alone is ambiguous. Correlating with kernel memory reclaim metrics confirms the accounting corruption hypothesis before triggering remediation.
Pitfall Guide
1. Mistaking Memory Pressure for CPU Throttling
Explanation: Teams observe high CPU throttling and assume compute shortage. The actual cause is kernel memory reclaim consuming cycles and triggering scheduler limits.
Fix: Always correlate container_cpu_cfs_throttled_seconds_total with node_memory_KswapdWriteback or node_vmstat_pgmajfault. If both spike together, investigate cgroup health before scaling.
2. Ignoring Cgroup Version Differences
Explanation: Cgroup v1 and v2 handle memory accounting differently. v1 uses separate memory.usage_in_bytes and memory.limit_in_bytes, while v2 consolidates them under memory.current and memory.max. Scripts written for v1 fail on v2 nodes.
Fix: Detect cgroup version at runtime using stat -fc %T /sys/fs/cgroup. Branch logic accordingly or migrate clusters to v2 for unified accounting.
3. Assuming All Sidecars Are Essential
Explanation: Platform teams deploy monitoring, logging, and security agents as sidecars or node-level services. Over time, workloads change, but agents remain. Unused agents leak resources silently.
Fix: Implement quarterly agent audits. Cross-reference active workloads with running agents. Disable or remove services with zero dependency graphs.
4. Overlooking Kernel Reclaim Overhead
Explanation: Direct memory reclaim runs in process context and consumes CPU cycles. Under heavy cgroup leaks, reclaim overhead can exceed 15% of available CPU, causing apparent throttling.
Fix: Monitor node_vmstat_pgscan_direct and node_vmstat_pgsteal_direct. If direct reclaim dominates, reduce cgroup fragmentation by consolidating workloads or disabling leak sources.
5. Relying Solely on Metrics-Server
Explanation: Kubernetes metrics-server aggregates pod-level CPU and memory usage. It does not expose kernel reclaim activity, cgroup accounting health, or node-level process anomalies.
Fix: Supplement metrics-server with node exporters, cgroup scrapers, and kernel tracing tools. Build dashboards that correlate container metrics with kernel behavior.
6. Disabling Agents Without Graceful Drain
Explanation: Terminating an agent abruptly can leave dangling cgroup entries or interrupt active log/metric pipelines. The kernel may retain orphaned references, prolonging the leak.
Fix: Use systemctl stop followed by cgroupfs-mount --remount or cgdelete to force cleanup. Verify /sys/fs/cgroup/memory before marking the node as healthy.
7. Skipping Post-Remediation Validation
Explanation: Teams disable the agent and assume the problem is resolved. Without validation, residual cgroup entries or secondary leaks can cause recurring throttling.
Fix: Run a 10-minute validation window. Monitor CPU throttling, memory reclaim metrics, and cgroup entry counts. Confirm normalization before closing the incident.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| ML training jobs show CPU throttling with healthy node metrics | Agent audit and cgroup scan | Throttling is likely accounting corruption, not compute shortage | Low (no node scaling required) |
| Cgroup v1 nodes exhibit persistent memory leaks | Migrate to cgroup v2 or apply kernel patches | v1 lacks unified memory accounting, increasing leak probability | Medium (requires cluster upgrade) |
| Multiple platform agents running with zero workload dependencies | Disable non-essential agents via systemd | Reduces kernel reclaim overhead and cgroup fragmentation | Low (immediate CPU recovery) |
| High direct reclaim overhead (>15% CPU) | Consolidate workloads and disable leak sources | Direct reclaim consumes CPU cycles, causing false throttling | Low (optimizes existing capacity) |
| Metrics-server shows normal usage but jobs stall | Deploy kernel-level tracing (bpftrace) | Metrics-server lacks kernel accounting visibility | Medium (requires tracing setup) |
Configuration Template
# kubelet-cgroup-config.yaml
# Aligns kubelet with cgroup v2 and disables legacy memory accounting
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
cgroupVersion: "v2"
memorySwap: {}
featureGates:
CgroupV2: true
# systemd-override-ecs-agent.service
[Service]
ExecStart=
ExecStart=/bin/true
Restart=no
LimitNOFILE=1024
LimitNPROC=512
# prometheus-cgroup-alerts.yaml
groups:
- name: cgroup-leak-detection
rules:
- alert: HighCgroupMemoryLeak
expr: rate(node_memory_KswapdWriteback_bytes_total[5m]) > 10485760
for: 2m
labels:
severity: warning
annotations:
summary: "Kernel memory reclaim exceeding threshold"
description: "Cgroup leak likely causing CPU throttling. Audit node agents."
Quick Start Guide
- Detect cgroup version: Run
stat -fc %T /sys/fs/cgroup on each worker node. If output is tmpfs, you are on v2. If cgroup, you are on v1. Adjust diagnostic scripts accordingly.
- Deploy cgroup scanner: Copy the TypeScript diagnostic utility to a monitoring pod or node agent. Execute
scanCgroupEntries() and filter results where leakProbability > 0.7.
- Correlate metrics: Query Prometheus for
rate(node_memory_KswapdWriteback_bytes_total[5m]) and rate(container_cpu_cfs_throttled_seconds_total[5m]). If both exceed thresholds simultaneously, proceed to agent audit.
- Audit and disable: Run
audit_node_agents.sh on affected nodes. Identify orphaned services. Apply systemd overrides to disable them. Force cgroup cleanup with cgdelete -r /sys/fs/cgroup/memory/<orphaned_path>.
- Validate: Monitor CPU throttling and memory reclaim metrics for 10 minutes. Confirm normalization. Document the agent lifecycle policy to prevent recurrence.