Datadog / linux

2 posts

Hardening eBPF for runtime security: Lessons from Datadog Workload Protection | Datadog (opens in new tab)

Scaling real-time file monitoring across high-traffic environments requires a strategy to process billions of kernel events without exhausting system resources. By leveraging eBPF, organizations can move filtering logic directly into the Linux kernel, drastically reducing the overhead associated with traditional userspace monitoring tools. This approach enables precise observability of file system activity while maintaining the performance necessary for large-scale production workloads. ### Limitations of Traditional Monitoring Tools * Conventional tools like `auditd` often struggle with performance bottlenecks because they require every event to be copied from the kernel to userspace for evaluation. * Standard APIs like `fanotify` and `inotify` lack the granularity needed for complex filtering, often resulting in "event storms" during high I/O operations. * The high frequency of context switching between kernel and userspace when processing billions of events per minute can lead to significant CPU spikes and system instability. ### Architecture of eBPF-Based File Monitoring * The system hooks into the Virtual File System (VFS) layer using `kprobes` and `tracepoints` to capture actions such as `vfs_read`, `vfs_write`, and `vfs_open`. * LSM (Linux Security Module) hooks are utilized for security-focused monitoring, providing a stable interface that is less prone to kernel version changes than raw kprobes. * By executing C-like code within the kernel’s sandboxed environment, the system can inspect file paths and process IDs (PIDs) instantly upon event creation. ### In-Kernel Filtering and Data Management * High-performance eBPF maps, specifically `BPF_MAP_TYPE_HASH` and `BPF_MAP_TYPE_LPM_TRIE`, are used to store allowlists and denylists for specific directories and file extensions. * The system implements prefix matching to ignore high-volume, low-value paths like `/proc`, `/sys`, or temporary build directories, discarding these events before they ever leave the kernel. * To minimize memory contention, per-CPU maps are employed, allowing the eBPF programs to aggregate data locally on each core without the need for expensive global locks. ### Efficient Data Transmission with Ring Buffers * The implementation utilizes `BPF_RINGBUF` rather than the older `BPF_PERF_EVENT_ARRAY` to handle data transfer to userspace. * Ring buffers provide a shared memory space between the kernel and userspace, offering better memory efficiency and guaranteeing event ordering. * By only pushing "filtered" events—representing a tiny fraction of the billions of raw kernel events—the system prevents userspace consumers from becoming overwhelmed. For organizations operating at massive scale, moving from reactive userspace logging to proactive kernel-level filtering is essential. Implementing an eBPF-based monitoring stack allows for deep visibility into file system changes with minimal performance impact, making it the recommended standard for modern, high-throughput cloud environments.

Using the Dirty Pipe vulnerability to break out from containers | Datadog (opens in new tab)

The Dirty Pipe vulnerability (CVE-2022-0847) is a critical Linux kernel flaw that allows unprivileged processes to write data to any file they can read, effectively bypassing standard write permissions. This primitive is particularly dangerous in containerized environments like Kubernetes, where it can be leveraged to overwrite the host’s container runtime binary. By exploiting how the kernel manages page caches, an attacker can achieve a full container breakout and gain administrative privileges on the underlying host. ## Container Runtimes and the OCI Specification * Kubernetes utilizes the Container Runtime Interface (CRI) to manage containers via high-level runtimes like containerd or CRI-O. * These high-level runtimes rely on low-level Open Container Interface (OCI) runtimes, most commonly runC, to handle the heavy lifting of namespaces and control groups. * Isolation is achieved by runC setting up a restricted environment before executing the user-supplied entrypoint via the `execve` system call. ## Evolution of runC Vulnerabilities * A historical vulnerability, CVE-2019-5736, previously allowed escapes by overwriting the host’s runC binary through the `/proc/self/exe` file descriptor. * To mitigate this, runC was updated to either clone the binary before execution or mount the host's runC binary as read-only inside the container. * While the read-only mount improved performance through kernel cache page sharing, it created a target for the Dirty Pipe vulnerability, which specifically targets the kernel page cache. ## The Dirty Pipe Exploitation Primitive * Dirty Pipe allows an attacker to overwrite any file they can read, including read-only files, by manipulating the kernel's internal pipe-buffer structures. * The exploit targets the page cache, meaning the overwrite is non-persistent and resides only in memory; the original file on disk remains unchanged. * In a container escape scenario, the attacker waits for a runC process to start (triggered by actions like `kubectl exec`) and targets the file descriptor at `/proc/<runC-pid>/exe`. ## Proof-of-Concept Escape Walkthrough * The attack begins with a standard, unprivileged pod running a malicious script that monitors the system for new runC processes. * Once a `kubectl exec` command is issued by an administrator, the script identifies the runC PID and applies the Dirty Pipe exploit to the associated executable. * The exploit overwrites the runC binary in the kernel page cache with a malicious ELF binary. * Because the host kernel is executing this hijacked binary with root privileges to manage the container, the attacker’s malicious code (e.g., a reverse shell or administrative command) runs with full host-level authority. To protect against this attack vector, it is essential to patch the Linux kernel to a version that includes the fix for CVE-2022-0847 and ensure that container nodes are running updated distributions.