Evolving our real-time timeseries storage again: Built in Rust for performance at scale | Datadog (opens in new tab)
Optimize HPC jobs and cluster utilization with Datadog
Optimize HPC jobs and cluster utilization with Datadog
Java memory management: How to monitor
Datadog’s Continuous Profiler timeline view addresses the limitations of traditional aggregate profiling by providing a temporal context for resource consumption. It allows developers to visualize how CPU usage, memory allocation, and thread activity evolve over time, making it easier to pinpoint transient performance regressions that are often masked by averages. By correlating execution patterns with specific time windows, teams can move beyond static flame graphs to understand the root causes of latency spikes and resource contention in live environments. ### Moving Beyond Aggregate Profiling * Traditional flame graphs aggregate data over a period, which can hide short-lived performance issues or intermittent stalls that do not significantly impact the overall average. * The timeline view introduces a chronological dimension, mapping stack traces to specific timestamps to show exactly when resource-intensive operations occurred. * This temporal granularity is essential for identifying "noisy neighbors" or periodic background tasks, such as scheduled jobs or cache invalidations, that disrupt request processing. ### Visualizing Thread Activity and Runtime Contention * The tool visualizes individual thread states, distinguishing between active CPU execution, waiting on locks, and I/O operations. * Developers can identify "Stop-the-World" garbage collection events or thread starvation by observing gaps in execution or excessive synchronization overhead within the timeline. * Specific metrics, including lock wait time and file/socket I/O, are overlaid on the timeline to provide a comprehensive view of how code interacts with the underlying runtime and hardware. ### Correlating Profiles with Distributed Traces * Integration between profiling and tracing allows users to pivot from a slow span in a distributed trace directly to the corresponding timeline view of the execution thread. * This correlation helps explain "unaccounted for" time in traces—such as time spent waiting for a CPU core or being blocked by a mutex—that traditional tracing cannot capture. * Filtering capabilities allow teams to isolate performance regressions by service, version, or environment, facilitating faster root-cause analysis during post-mortems. To optimize production performance effectively, teams should incorporate timeline analysis into their standard debugging workflow for latency spikes rather than relying solely on aggregate metrics. By combining chronological thread analysis with distributed tracing, developers can resolve complex concurrency issues and "tail latency" problems that aggregate profiling often overlooks.
Datadog’s Continuous Profiler timeline view offers a granular look at application performance by mapping code execution directly to a temporal axis. This allows engineers to move beyond aggregate flame graphs to understand exactly when and why specific bottlenecks occur during a request’s lifecycle. By correlating traces with detailed profile data, teams can effectively isolate the root causes of latency spikes and resource exhaustion in live production environments. ### Bridging the Gap Between Tracing and Profiling * While distributed tracing identifies which service or span is slow, profiling explains the "why" by showing execution at the method and line level. * The timeline view integrates profile data with specific trace spans, allowing users to zoom into the exact millisecond a performance degradation began. * By toggling between CPU time and wall time, developers can distinguish between active computation and passive waiting, providing a clearer picture of thread state. ### Visualizing CPU-Bound Inefficiencies * The tool identifies "hot" methods that consume excessive CPU cycles, such as inefficient regular expressions, heavy JSON serialization, or intensive cryptographic operations. * It detects transient CPU spikes that might be averaged out or hidden in traditional 60-second aggregate profiles. * Engineers can correlate CPU usage with specific threads to identify background tasks or "noisy neighbor" processes that impact the responsiveness of the main application logic. ### Diagnosing Wall Time and Runtime Overhead * Wall time analysis reveals where threads are blocked by external factors like I/O operations, database wait times, or mutex lock contention. * The view surfaces runtime-specific issues such as Garbage Collection (GC) pauses and Safepoint intervals that halt execution across the entire virtual machine. * This visibility is critical for troubleshooting synchronization issues where a thread is idle and waiting for a resource, a scenario that often causes high latency without showing up in CPU-only profiles. To maintain high availability and performance, organizations should integrate continuous profiling into their standard troubleshooting workflows, enabling a seamless transition from detecting a slow trace to identifying the offending line of code or runtime event.
The March 2023 Datadog outage was triggered by a simultaneous, global failure across multiple cloud providers and regions, caused by an unexpected interaction between a systemd security patch and Ubuntu 22.04’s default networking behavior. While Datadog typically employs rigorous, staged rollouts for infrastructure changes, the automated nature of OS-level security updates bypassed these controls. The incident highlights the hidden risks in system-level defaults and the potential for "unattended upgrades" to create synchronized failures across supposedly isolated environments. ## The systemd-networkd Routing Change * In December 2020, systemd version 248 introduced a change where `systemd-networkd` flushes all IP routing rules it does not recognize upon startup. * Version 249 introduced the `ManageForeignRoutingPolicyRules` setting, which defaults to "yes," confirming this management behavior for any rules not explicitly defined in systemd configuration files. * These changes were backported to earlier versions (v247 and v248) but were notably absent from v245, the version used in Ubuntu 20.04. ## Dormant Risks in the Ubuntu 22.04 Migration * Datadog began migrating its fleet from Ubuntu 20.04 to 22.04 in late 2022, eventually reaching 90% coverage across its infrastructure. * Ubuntu 22.04 utilizes systemd v249, meaning the majority of the fleet was susceptible to the routing rule flushing behavior. * The risk remained dormant during the initial rollout because `systemd-networkd` typically only starts during the initial boot sequence when no complex routing rules have been established yet. ## The Trigger: Unattended Upgrades and the CVE Patch * On March 7, 2023, a security patch for a systemd CVE was released to the Ubuntu security repositories. * Datadog’s fleet used the Ubuntu default configuration for `unattended-upgrades`, which automatically installs security-labeled patches once a day, typically between 06:00 and 07:00 UTC. * The installation of the patch forced a restart of the `systemd-networkd` service on active, running nodes. * Upon restarting, the service identified existing IP routing rules (crucial for container networking) as "foreign" and deleted them, effectively severing network connectivity for the nodes. ## Failure of Regional Isolation * Because the security patch was released globally and the automated upgrade window was synchronized across regions, the failure occurred nearly simultaneously worldwide. * This automation bypassed Datadog’s standard practice of "baking" changes in staging and experimental clusters for weeks before proceeding to production. * Nodes on the older Ubuntu 20.04 (systemd v245) were unaffected by the patch, as that version of systemd does not flush IP rules upon a service restart. To mitigate similar risks, infrastructure teams should consider explicitly disabling the management of foreign routing rules in systemd-networkd configuration when using third-party networking plugins. Furthermore, while automated security patching is a best practice, organizations must balance the speed of patching with the need for controlled, staged rollouts to prevent global configuration drift or synchronized failures.
Datadog replaced its complex, dashboard-specific scheduling system with a generalized, modular query and render scheduler to improve performance across all its web applications. By simplifying query heuristics and leveraging the Browser Scheduling API for renders, the engineering team achieved a more stable backend load and smoother UI interactions. This transition transformed a brittle set of rules into a scalable framework that optimizes resource utilization based on widget visibility and browser availability. ## Limitations of Legacy Scheduling The original scheduling system was a complex web of over 20 interlinked heuristics that became difficult for developers to maintain or reason about. While it performed better than an unscheduled baseline, it suffered from several structural flaws: * **Tight Coupling:** Query and render logic were unnecessarily linked; for example, fetches were sometimes delayed based on pending render tasks, even when throttling fetches wasn’t necessary. * **Lack of Generalization:** The system was hardcoded specifically for dashboards, making it impossible to use the same optimization benefits for other widget-heavy products in the Datadog suite. * **Inefficient Resource Management:** Renders were often delayed based on arbitrary data size rules rather than the actual real-time availability of the browser's CPU and memory resources. ## A Simplified Query Algorithm To create a more predictable and efficient system, the team stripped away redundant rules—such as manual throttling for unfocused tabs, which modern browsers already handle—and moved to a streamlined query model. The new algorithm is governed by only six parameters: * **Visibility Priority:** Fetches for widgets currently visible in the viewport are executed immediately to ensure a responsive user experience. * **Fixed Time Windows:** Non-visible queries are ranked by enqueue time and processed in 2000ms windows with a limit of 10 tasks per window. * **Error Reduction:** The more stable distribution of tasks significantly reduced "429 (Too many requests)" errors, leading to faster overall data loading since fewer retries are required. * **Framework Integration:** This simplified logic was moved into a standard data-fetching framework, allowing any Datadog product using generalized components to benefit from the scheduler. ## Render Scheduling with the Browser Scheduling API While the query scheduler handles data fetching, a separate render scheduler manages the impact on the browser’s main thread. By moving away from legacy heuristics and adopting the Browser Scheduling API, Datadog can now schedule tasks based on native browser priorities: * **Prioritization:** The API allows developers to categorize tasks as `user-blocking`, `user-visible`, or `background`, ensuring the browser prioritizes critical UI updates while deferring heavy computations to idle periods. * **Resource Awareness:** Unlike the old system, this API is natively aware of CPU and memory pressure, allowing the browser to manage execution timing more effectively than a JavaScript-based heuristic. * **Future-Proofing:** Currently supported in Chromium and Firefox Nightly (with polyfills for others), this approach allows for mass updates to task priorities and the ability to abort stale tasks via `TaskController`. Standardizing on a modular scheduling architecture allows engineering teams to optimize both network traffic and main-thread performance without the maintenance overhead of complex, custom rule sets. For high-density data applications, leveraging native browser APIs for task prioritization is recommended to ensure smooth rendering across varying hardware capabilities.
Datadog integrations 2025 recap: Observability for AI, security, and hybrid cloud
GopherCon India talk - Go faster: optimizing Go programs
Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes. ### Challenges with KSM Scalability * KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size. * In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics. * In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes. * The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity. ### Limitations of Legacy Implementations * KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation. * The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time. * To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments). * This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization. ### Architectural Improvements in KSM v2.0 * Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design. * The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time. * By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins. * The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced. Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.
Datadog API client libraries now available for Java and Go
Collecting Kafka performance metrics
The updated GitLab Security Dashboard addresses the challenge of vulnerability overload by shifting the focus from simple detection to contextual remediation and risk management. By providing integrated trend tracking and sophisticated risk scoring, the platform enables security and development teams to prioritize high-risk projects and measure the actual progress of their security programs. This update transforms raw security data into actionable insights that are tracked directly within the existing DevSecOps workflow. ## Transitioning from Detection to Remediation Context * Consolidates vulnerability data into a single view that spans across projects, groups, and entire business units to eliminate data silos. * Introduced initial time-based tracking in version 18.6, with version 18.9 adding expanded filters for severity, status, scanner type, and project. * Provides visualizations for remediation velocity and vulnerability age distribution, moving beyond static raw counts to show how quickly threats are being addressed. ## Data-Driven Prioritization with Risk Scoring * Utilizes a dynamic risk score calculated from multiple factors, including vulnerability age and repository security postures. * Integrates external threat intelligence such as the Exploit Prediction Scoring System (EPSS) and Known Exploited Vulnerability (KEV) scores to identify the most critical threats. * Allows teams to monitor risk scores over time to pinpoint specific areas of the infrastructure that require additional resources or immediate intervention. ## Strategic Impact for Security and Development Teams * Enables security leaders to prove program effectiveness to executives by showing downward trends in Common Weakness Enumeration (CWE) types and shrinking backlogs. * Streamlines the developer experience by highlighting critical vulnerabilities within active projects, removing the need for external spreadsheets or manual reporting tools. * Identifies specific teams or departments that may require additional remediation training based on their ability to meet company security policies. Organizations should leverage these updated dashboard features to transition from manual, reactive security tracking to an automated, risk-based posture. By integrating EPSS and KEV data into daily workflows, teams can ensure they are solving the most dangerous vulnerabilities first while maintaining a clear, measurable record of their security improvements.
Hardening eBPF for runtime security: Lessons from Datadog Workload Protection
Scaling real-time file monitoring across high-traffic environments requires a strategy to process billions of kernel events without exhausting system resources. By leveraging eBPF, organizations can move filtering logic directly into the Linux kernel, drastically reducing the overhead associated with traditional userspace monitoring tools. This approach enables precise observability of file system activity while maintaining the performance necessary for large-scale production workloads. ### Limitations of Traditional Monitoring Tools * Conventional tools like `auditd` often struggle with performance bottlenecks because they require every event to be copied from the kernel to userspace for evaluation. * Standard APIs like `fanotify` and `inotify` lack the granularity needed for complex filtering, often resulting in "event storms" during high I/O operations. * The high frequency of context switching between kernel and userspace when processing billions of events per minute can lead to significant CPU spikes and system instability. ### Architecture of eBPF-Based File Monitoring * The system hooks into the Virtual File System (VFS) layer using `kprobes` and `tracepoints` to capture actions such as `vfs_read`, `vfs_write`, and `vfs_open`. * LSM (Linux Security Module) hooks are utilized for security-focused monitoring, providing a stable interface that is less prone to kernel version changes than raw kprobes. * By executing C-like code within the kernel’s sandboxed environment, the system can inspect file paths and process IDs (PIDs) instantly upon event creation. ### In-Kernel Filtering and Data Management * High-performance eBPF maps, specifically `BPF_MAP_TYPE_HASH` and `BPF_MAP_TYPE_LPM_TRIE`, are used to store allowlists and denylists for specific directories and file extensions. * The system implements prefix matching to ignore high-volume, low-value paths like `/proc`, `/sys`, or temporary build directories, discarding these events before they ever leave the kernel. * To minimize memory contention, per-CPU maps are employed, allowing the eBPF programs to aggregate data locally on each core without the need for expensive global locks. ### Efficient Data Transmission with Ring Buffers * The implementation utilizes `BPF_RINGBUF` rather than the older `BPF_PERF_EVENT_ARRAY` to handle data transfer to userspace. * Ring buffers provide a shared memory space between the kernel and userspace, offering better memory efficiency and guaranteeing event ordering. * By only pushing "filtered" events—representing a tiny fraction of the billions of raw kernel events—the system prevents userspace consumers from becoming overwhelmed. For organizations operating at massive scale, moving from reactive userspace logging to proactive kernel-level filtering is essential. Implementing an eBPF-based monitoring stack allows for deep visibility into file system changes with minimal performance impact, making it the recommended standard for modern, high-throughput cloud environments.
GitLab 18.9 introduces critical updates designed to provide regulated enterprises with governed, agentic AI capabilities through self-hosted infrastructure and model flexibility. By combining the Duo Agent Platform with Bring Your Own Model (BYOM) support, organizations in sectors like finance and government can now automate complex DevSecOps workflows while maintaining total control over data residency. This release transforms GitLab into a high-security AI control plane that balances the need for advanced automation with the rigid sovereignty requirements of high-compliance environments. ## Self-Hosted Duo Agent Platform for Online Cloud Licenses The Duo Agent Platform allows engineering teams to automate sequences of tasks, such as hardening CI/CD pipelines and triaging vulnerabilities, but was previously difficult to deploy for customers under strict online cloud licensing. This update makes the platform generally available for these environments, bridging the gap between cloud-based licensing and self-hosted security needs. * **Usage-Based Billing:** The platform now utilizes GitLab Credits to provide transparent, per-request metering, which is essential for internal chargeback and regulatory reporting. * **Infrastructure Control:** Enterprises can host models on their own internal infrastructure or within approved cloud environments, ensuring that inference traffic is routed according to internal security policies. * **Deployment Readiness:** By removing the requirement to route data through external AI vendors, the platform is now a viable option for critical infrastructure and government agencies. ## Bring Your Own Model (BYOM) Integration Recognizing that many enterprises have already invested in domain-tuned LLMs or air-gapped deployments, GitLab now allows customers to integrate their existing models directly into the Duo Agent Platform. This ensures that organizations are not locked into a specific vendor and can leverage models that have already passed internal risk assessments. * **AI Gateway Connectivity:** Administrators can connect third-party or internal models via the GitLab AI Gateway, allowing these models to function as enterprise-ready options within the GitLab ecosystem. * **Granular Model Mapping:** The system provides the ability to map specific models to individual Duo Agent Platform flows or features, giving admins fine-grained control over which agent uses which model. * **Administrative Ownership:** While GitLab provides the orchestration layer, administrators retain full responsibility for model validation, performance tuning, and risk evaluation for the models they choose to bring. For organizations operating in high-compliance sectors, these updates offer a path to consolidate fragmented AI tools into a single, governed platform. Engineering leaders should evaluate their current model investments and leverage the GitLab AI Gateway to unify their automation workflows under one secure DevSecOps umbrella.