Datadog

91 posts

www.datadoghq.com/blog/engineering

Filter by tag

datadog

Hardening eBPF for runtime security: Lessons from Datadog Workload Protection | Datadog (opens in new tab)

Scaling real-time file monitoring across high-traffic environments requires a strategy to process billions of kernel events without exhausting system resources. By leveraging eBPF, organizations can move filtering logic directly into the Linux kernel, drastically reducing the overhead associated with traditional userspace monitoring tools. This approach enables precise observability of file system activity while maintaining the performance necessary for large-scale production workloads. ### Limitations of Traditional Monitoring Tools * Conventional tools like `auditd` often struggle with performance bottlenecks because they require every event to be copied from the kernel to userspace for evaluation. * Standard APIs like `fanotify` and `inotify` lack the granularity needed for complex filtering, often resulting in "event storms" during high I/O operations. * The high frequency of context switching between kernel and userspace when processing billions of events per minute can lead to significant CPU spikes and system instability. ### Architecture of eBPF-Based File Monitoring * The system hooks into the Virtual File System (VFS) layer using `kprobes` and `tracepoints` to capture actions such as `vfs_read`, `vfs_write`, and `vfs_open`. * LSM (Linux Security Module) hooks are utilized for security-focused monitoring, providing a stable interface that is less prone to kernel version changes than raw kprobes. * By executing C-like code within the kernel’s sandboxed environment, the system can inspect file paths and process IDs (PIDs) instantly upon event creation. ### In-Kernel Filtering and Data Management * High-performance eBPF maps, specifically `BPF_MAP_TYPE_HASH` and `BPF_MAP_TYPE_LPM_TRIE`, are used to store allowlists and denylists for specific directories and file extensions. * The system implements prefix matching to ignore high-volume, low-value paths like `/proc`, `/sys`, or temporary build directories, discarding these events before they ever leave the kernel. * To minimize memory contention, per-CPU maps are employed, allowing the eBPF programs to aggregate data locally on each core without the need for expensive global locks. ### Efficient Data Transmission with Ring Buffers * The implementation utilizes `BPF_RINGBUF` rather than the older `BPF_PERF_EVENT_ARRAY` to handle data transfer to userspace. * Ring buffers provide a shared memory space between the kernel and userspace, offering better memory efficiency and guaranteeing event ordering. * By only pushing "filtered" events—representing a tiny fraction of the billions of raw kernel events—the system prevents userspace consumers from becoming overwhelmed. For organizations operating at massive scale, moving from reactive userspace logging to proactive kernel-level filtering is essential. Implementing an eBPF-based monitoring stack allows for deep visibility into file system changes with minimal performance impact, making it the recommended standard for modern, high-throughput cloud environments.

datadog

Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog | Datadog (opens in new tab)

Following a major 2023 incident that caused a near-total platform outage despite partial infrastructure availability, Datadog shifted its engineering philosophy from "never-fail" architectures to a model of graceful degradation. The company identified that prioritizing absolute data correctness during systemic stress created "square-wave" failures, where the entire platform appeared down if even a portion of data was missing. By moving toward a "fail better" mindset, Datadog now focuses on maintaining core functionality and data persistence even when underlying infrastructure is compromised. ## Limitations of the Never-Fail Approach * Classical root-cause analysis focused on a legacy, unsupervised global update mechanism that disconnected 50–60% of production Kubernetes nodes. * While the "precipitating event" was easily identified and disabled, the engineering team realized that fixing the trigger did not address the systemic fragility that caused a binary (up/down) failure pattern. * Prioritizing absolute accuracy meant that systems would wait for all data tags to process before displaying results; under stress, this caused the UI to show no data at all rather than "almost correct" data. * Sequential queuing, aggressive retry logic, and node-specific processing requirements exacerbated the bottleneck, preventing real-time recovery. ## Prioritizing Graceful Degradation * The incident prompted a shift away from relying solely on redundancy to prevent outages, acknowledging that some level of failure is eventually inevitable at scale. * Engineering priorities were redefined to ensure that data is never lost (even if delayed) and that real-time data is processed before stale backlogs. * The platform now aims to serve partial-but-accurate results to customers during an incident, providing visibility rather than a complete blackout. * Implementation is handled as a company-wide program where individual product teams adapt these principles to their specific architectural needs. ## Strengthening Data Persistence at Intake * Analysis revealed that data was lost during the outage because it was stored in memory or on local disks before being replicated to persistent stores. * The original design favored low-latency responses by acknowledging receipt of data before it was fully replicated, making that data unrecoverable if the node failed. * Downstream failures caused intake nodes to overflow their local buffers, leading to data loss even on nodes that remained online. * New architectural changes focus on implementing disk-based persistence at the very beginning of the processing pipeline to ensure data survives node restarts and downstream congestion. To build truly resilient systems, engineering teams must move beyond trying to prevent every possible failure trigger. Instead, focus on designing services that can survive partial infrastructure loss by prioritizing data persistence and allowing for degraded states that still provide value to the end user.