performance-optimization - Datadog

datadog Apr 21, 2026

Steganography at scale: Embedding share URLs in Datadog widget screenshots (opens in new tab)

Bartol Rebernjak Galen Pickard Sharing a visualization from a Datadog dashboard or notebook is simple. You can copy and paste a widget directly into another dashboard, message, or document. Behind the scenes, the widget is stored with a unique sharing URL, which the Datadog app…

performance-optimization datadog redis watermarking+3

datadog Nov 1, 2024

How we built a Ruby library that saves 50% in testing time (opens in new tab)

Andrey Marchenko Gillian McGarvey Do you know that feeling when the coding is done and the pull request is approved, and you only need a green pipeline for the merge to be complete? Then the dreaded sequence of events occurs: running your test suite takes 20+ minutes and then fa…

performance-optimization datadog cloud-computing continuous-integration+4

datadog Nov 1, 2024

How we built a Ruby library that saves 50% in testing time | Datadog (opens in new tab)

Lengthy CI pipelines and flaky tests often hinder developer productivity by causing unnecessary wait times and costly infrastructure usage. To address this, Datadog developed a Ruby test impact analysis library that dynamically maps tests to specific source files, allowing the CI runner to skip tests unrelated to the latest code changes. By moving beyond standard coverage tools and utilizing low-level Ruby VM interpreter events, this solution significantly reduces testing time while maintaining high performance and correctness. ## The Strategy of Test Impact Analysis * Lengthy CI pipelines (often exceeding 20 minutes) increase the likelihood of intermittent "flaky" failures that are unrelated to current code changes. * While parallelization can reduce time, it increases cloud computing costs and does not mitigate the flakiness of irrelevant tests. * Test impact analysis generates a dynamic map between each test and the source files executed during its run; if a commit doesn't touch those files, the test is safely skipped. * Success depends on three pillars: correctness (never skipping a necessary test), performance (low overhead), and seamlessness (no required code changes for the user). ## Limitations of Standard Coverage Tools * Ruby’s built-in `Coverage` module (enhanced in version 3.1 with `resume`/`suspend` methods) proved incompatible with existing total code coverage tools like `simplecov`. * Initial prototypes using the `Coverage` module showed a performance overhead of 300%, making the test suite four times slower. * The `TracePoint` API was also evaluated as an alternative to spy on code execution via the `line` event, but it still produced a significant median overhead of 200% to 400%. * Benchmarks were conducted using the `rubocop` test suite—a "hard mode" scenario with 20,000+ tests—to ensure the tool could handle high-sensitivity environments. ## Implementing a Custom C Extension * To bypass the limitations of high-level APIs, developers utilized Ruby’s C extension capabilities to hook directly into the Virtual Machine. * The library uses `rb_add_event_hook2` and `rb_thread_add_event_hook` to subscribe to the `RUBY_EVENT_LINE` event at the interpreter level. * The implementation involves a C-based `dd_cov_start` function that triggers when a test begins and a `dd_cov_stop` function to collect the results. * During execution, the tool uses `rb_sourcefile()` to identify the current file and stores it in a Ruby hash only if the file is located within the project’s root directory. For engineering teams struggling with bloated CI pipelines, adopting test impact analysis is a highly effective way to optimize resources. By utilizing tools like Datadog’s Intelligent Test Runner, which leverages low-level VM events for minimal overhead, teams can cut their testing time in half without sacrificing the reliability of their master branch.

performance-optimization datadog continuous-integration test-impact-analysis+4

datadog May 20, 2024

.NET Continuous Profiler: Memory usage (opens in new tab)

Christophe Nasarre In Part 1 of this series, I presented a high-level overview of the architecture, implementation, and initialization of Datadog's .NET profiler, which consists of several individual profilers that collect data for particular resources. I went on to discuss prof…

performance-optimization datadog garbage-collection dotnet+4

datadog Apr 18, 2024

How we brought Datadog's data visualization to iOS: A focus on performance (opens in new tab)

Yassir Ramdani Austin Lai At Datadog, we’ve been using SwiftUI since day one. We went from initially using it for prototyping and building internal tools, to adopting it in small features, then to building full products! In 2022, we introduced APM Services with its rich data vis…

performance-optimization data-visualization mobile-app-development swiftui+3

datadog Feb 13, 2024

.NET Continuous Profiler: CPU and wall time profiling | Datadog (opens in new tab)

Datadog’s Continuous Profiler timeline view offers a granular look at application performance by mapping code execution directly to a temporal axis. This allows engineers to move beyond aggregate flame graphs to understand exactly when and why specific bottlenecks occur during a request’s lifecycle. By correlating traces with detailed profile data, teams can effectively isolate the root causes of latency spikes and resource exhaustion in live production environments. ### Bridging the Gap Between Tracing and Profiling * While distributed tracing identifies which service or span is slow, profiling explains the "why" by showing execution at the method and line level. * The timeline view integrates profile data with specific trace spans, allowing users to zoom into the exact millisecond a performance degradation began. * By toggling between CPU time and wall time, developers can distinguish between active computation and passive waiting, providing a clearer picture of thread state. ### Visualizing CPU-Bound Inefficiencies * The tool identifies "hot" methods that consume excessive CPU cycles, such as inefficient regular expressions, heavy JSON serialization, or intensive cryptographic operations. * It detects transient CPU spikes that might be averaged out or hidden in traditional 60-second aggregate profiles. * Engineers can correlate CPU usage with specific threads to identify background tasks or "noisy neighbor" processes that impact the responsiveness of the main application logic. ### Diagnosing Wall Time and Runtime Overhead * Wall time analysis reveals where threads are blocked by external factors like I/O operations, database wait times, or mutex lock contention. * The view surfaces runtime-specific issues such as Garbage Collection (GC) pauses and Safepoint intervals that halt execution across the entire virtual machine. * This visibility is critical for troubleshooting synchronization issues where a thread is idle and waiting for a resource, a scenario that often causes high latency without showing up in CPU-only profiles. To maintain high availability and performance, organizations should integrate continuous profiling into their standard troubleshooting workflows, enabling a seamless transition from detecting a slow trace to identifying the offending line of code or runtime event.

performance-optimization continuous-profiler runtime-diagnosis code-inefficiencies

datadog Apr 17, 2023

Making fetch happen: Building a general-purpose query and render scheduler (opens in new tab)

Cormac Flynn Users expect web applications to be fast and responsive, with smooth scrolling and almost instantaneous rendering. Combining complex UI interactions with frequent data fetching, as many Datadog products do, makes optimizing for good runtime performance a challenge.…

performance-optimization database-design query-scheduling browser-scheduling-api+4

datadog Jan 31, 2023

Performance improvements in the Datadog Agent metrics pipeline (opens in new tab)

Remy Mathieu Our aspiration for the Datadog Agent is for it to process the maximum amount of data, very quickly, with as low of a CPU as possible. Striking this balance between performance and efficiency is an ongoing challenge for us. We are constantly searching for ways to opt…

performance-optimization go cpu-profiling datadog-agent+3

datadog Apr 24, 2018

Using Datadog APM to improve the performance of Homebrew (opens in new tab)

Andrew Robert McBurney As a Software Engineering Intern on the open source team at Datadog, I contribute to various open source projects using Datadog tools to track down bugs and performance related bottlenecks. One of the tools I use is Application Performance Monitoring (APM)…

performance-optimization datadog caching ruby+4