e requires three coordinated layers: kernel-space probes, a portable compilation strategy, and a userspace aggregation agent. The architecture prioritizes stability, verifiability, and seamless integration with existing Prometheus/Grafana stacks.
Step 1: Kernel Prerequisites and BTF Validation
eBPF programs require the BPF Type Format (BTF) to achieve binary portability. BTF embeds kernel struct definitions into the kernel image, allowing the eBPF verifier to resolve struct field offsets at load time. Without BTF, probes must be compiled against exact kernel headers for each node, which breaks in auto-scaling or auto-updating managed clusters.
Verify BTF availability before deployment:
# Check if BTF is embedded in the running kernel
cat /sys/kernel/btf/vmlinux | file -
# Expected: /dev/stdin: BTF blob
Managed Kubernetes distributions (GKE, EKS with Amazon Linux 2023, AKS) ship with BTF-enabled kernels (5.8+). If running self-managed clusters, ensure CONFIG_DEBUG_INFO_BTF=y is set during kernel compilation.
Step 2: Portable Probe Design with BPF CO-RE
BPF CO-RE (Compile Once, Run Everywhere) eliminates kernel-version coupling. Instead of hardcoding struct offsets, CO-RE uses BPF_CORE_READ macros to defer relocation to load time. The probe compiles once in CI, ships as a container image, and loads safely across heterogeneous node pools.
Below is a production-ready C probe that tracks TCP retransmits and extracts destination port/address for pod correlation:
// tcp_retransmit_tracker.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
struct retransmit_event {
__u32 saddr;
__u32 daddr;
__u16 sport;
__u16 dport;
__u64 timestamp_ns;
};
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} retransmit_ring SEC(".maps");
SEC("tracepoint/tcp/tcp_retransmit_skb")
int handle_tcp_retransmit(struct trace_event_raw_tcp_event_sk_skb *ctx)
{
struct sock *sk = (struct sock *)ctx->skaddr;
if (!sk) return 0;
struct retransmit_event *evt = bpf_ringbuf_reserve(&retransmit_ring, sizeof(*evt), 0);
if (!evt) return 0;
evt->saddr = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
evt->daddr = BPF_CORE_READ(sk, __sk_common.skc_daddr);
evt->sport = bpf_ntohs(BPF_CORE_READ(sk, __sk_common.skc_sport));
evt->dport = bpf_ntohs(BPF_CORE_READ(sk, __sk_common.skc_dport));
evt->timestamp_ns = bpf_ktime_get_ns();
bpf_ringbuf_submit(evt, 0);
return 0;
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
Architecture Rationale:
BPF_MAP_TYPE_RINGBUF replaces legacy perf buffers. It provides lockless, batched delivery to userspace, reducing context-switch overhead.
- Tracepoint attachment (
tcp_retransmit_skb) is stable across kernel versions. Unlike kprobes, tracepoints guarantee ABI stability.
- Network byte order conversion (
bpf_ntohs) happens in-kernel to avoid userspace parsing delays.
Step 3: Userspace Aggregation and Prometheus Exposition
Kernel probes emit raw events. A userspace DaemonSet agent reads the ring buffer, correlates network addresses to Kubernetes pod metadata via the API server, and aggregates metrics into Prometheus histograms.
// collector.go (userspace agent)
package main
import (
"context"
"fmt"
"net"
"time"
"github.com/cilium/ebpf/ringbuf"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
var (
tcpRetransmits = promauto.NewCounterVec(
prometheus.CounterOpts{Name: "node_tcp_retransmits_total"},
[]string{"pod", "namespace", "service", "dest_port"},
)
)
func main() {
ctx := context.Background()
config, _ := rest.InClusterConfig()
clientset, _ := kubernetes.NewForConfig(config)
rd, _ := ringbuf.NewReader(ringBufMap)
defer rd.Close()
for {
record, _ := rd.Read()
var evt RetransmitEvent
// Unmarshal binary event into struct
// ... (binary decoding omitted for brevity)
destIP := net.IP(evt.Daddr[:]).String()
podMeta := resolvePodByIP(ctx, clientset, destIP)
tcpRetransmits.WithLabelValues(
podMeta.Name, podMeta.Namespace, podMeta.Labels["app"],
fmt.Sprintf("%d", evt.Dport),
).Inc()
}
}
Architecture Rationale:
- The agent runs as a DaemonSet, ensuring one instance per node. It watches the Kubernetes API for pod IP assignments, maintaining a local cache to resolve
daddr to pod/namespace/service labels.
- Prometheus counters/histograms are exposed via
/metrics. No external push gateways required.
- Binary event decoding uses
encoding/binary with little-endian alignment matching the kernel struct layout.
For L7 metrics, attach uprobes or tracepoints to accept, read, and write syscalls. Parse the initial bytes in-kernel to extract HTTP method and status code. Aggregate durations into Prometheus histograms:
http_request_duration_seconds_bucket{pod="checkout-svc-8f2a",method="POST",status="201",le="0.05"} 8420
http_request_duration_seconds_bucket{pod="checkout-svc-8f2a",method="POST",status="201",le="0.1"} 9105
http_request_duration_seconds_bucket{pod="checkout-svc-8f2a",method="POST",status="201",le="+Inf"} 9142
This approach works across Go, Rust, Python, and Node.js because it intercepts the OS boundary, not the runtime. No SDK installation, no environment variable injection, no container rebuilds.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Assuming eBPF replaces distributed tracing | eBPF captures network and syscall boundaries, not application-level trace context (e.g., traceparent headers). It cannot reconstruct cross-service spans. | Pair eBPF metrics with OpenTelemetry SDKs for trace propagation. Use eBPF for aggregate latency/error rates, OTel for request-scoped traces. |
| TLS/mTLS blindness at socket layer | eBPF probes attached to tcp_sendmsg or tcp_recvmsg see encrypted payloads. HTTP method/status codes are unreadable. | Attach uprobes to TLS library entry points (SSL_read/SSL_write in OpenSSL, rustls equivalents). Maintain version-specific offset maps or use CO-RE-compatible TLS libraries. |
| Verifier rejection due to unbounded complexity | The eBPF verifier rejects programs with unbounded loops, excessive stack usage, or unvalidated pointer arithmetic. | Replace loops with bpf_loop() helper, limit stack to 512 bytes, use BPF_CORE_READ for all struct access, and validate pointers before dereferencing. |
| Ring buffer overflow under high throughput | If userspace reads slower than kernel emits events, the ring buffer drops samples. Metrics become inaccurate during traffic spikes. | Size ring buffer to 256KBβ1MB per node, implement backpressure-aware reading, and monitor bpf_ringbuf_discard events. Consider per-CPU maps for extreme throughput. |
| Pod IP cache staleness | Kubernetes rapidly assigns/reclaims IPs. A static lookup table causes metric misattribution or label drift. | Watch v1/pods with resourceVersion, maintain a TTL-based cache (30s), and fall back to node-level labels when pod resolution fails. |
| Metric cardinality explosion | Attaching labels like request_id or full_url to Prometheus metrics creates unbounded series, crashing the TSDB. | Limit labels to pod, namespace, service, method, status. Use exemplars for trace correlation instead of high-cardinality labels. |
| Ignoring kernel version floor | BTF and CO-RE require kernel 5.8+. Deploying to legacy nodes causes load failures or fallback to non-portable probes. | Validate node kernel versions in admission webhooks or DaemonSet tolerations. Ship fallback probes or exclude pre-5.8 nodes from eBPF telemetry. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Polyglot microservices, strict budget | eBPF DaemonSet + Prometheus | Zero SDK overhead, scales by node, eliminates licensing | ~$0 licensing, minimal compute |
| Strict compliance requiring full request tracing | OpenTelemetry SDKs + eBPF metrics | OTel handles trace context; eBPF handles network health | Moderate SDK overhead, no licensing |
| Legacy cluster (kernel < 5.8) | Sidecar proxy or host-level agent | BTF unavailable, CO-RE fails, kernel compatibility risk | Higher memory/CPU, potential licensing |
| High-throughput gRPC/mTLS services | eBPF + TLS library uprobes | Socket layer blind to ciphertext; uprobes decode at library boundary | Requires version mapping, moderate dev effort |
| Multi-tenant cluster with strict isolation | eBPF with cgroup-based filtering | Prevents cross-tenant metric leakage, enforces namespace boundaries | Requires cgroupv2, moderate config complexity |
Configuration Template
# ebpf-telemetry-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kernel-telemetry-agent
namespace: monitoring
spec:
selector:
matchLabels:
app: ebpf-collector
template:
metadata:
labels:
app: ebpf-collector
spec:
hostNetwork: true
hostPID: true
containers:
- name: collector
image: registry.internal/ebpf-telemetry:latest
securityContext:
privileged: true
volumeMounts:
- name: bpf-maps
mountPath: /sys/fs/bpf
- name: proc
mountPath: /host/proc
readOnly: true
volumes:
- name: bpf-maps
hostPath:
path: /sys/fs/bpf
- name: proc
hostPath:
path: /proc
---
# prometheus-scrape-config.yaml
scrape_configs:
- job_name: 'ebpf-node-telemetry'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
target_label: __address__
replacement: '${1}:9090'
metrics_path: /metrics
Quick Start Guide
- Validate kernel compatibility: Run
cat /sys/kernel/btf/vmlinux | file - on a representative node. Proceed only if output confirms BTF blob.
- Compile probes: Use
bpftool gen skeleton or cilium/ebpf toolchain to generate CO-RE binaries. Package into a container image with the userspace Go agent.
- Deploy DaemonSet: Apply the YAML manifest. Verify pods reach
Running state and mount /sys/fs/bpf successfully.
- Expose metrics: Confirm the agent listens on port
9090 and serves Prometheus exposition format at /metrics.
- Ingest and visualize: Add the scrape config to Prometheus. Import pre-built Grafana dashboards for TCP retransmit rates and HTTP latency histograms. Validate pod label resolution against Kubernetes API.