Datadog / performance-optimization

9 posts

datadog

How we built a Ruby library that saves 50% in testing time | Datadog (opens in new tab)

Lengthy CI pipelines and flaky tests often hinder developer productivity by causing unnecessary wait times and costly infrastructure usage. To address this, Datadog developed a Ruby test impact analysis library that dynamically maps tests to specific source files, allowing the CI runner to skip tests unrelated to the latest code changes. By moving beyond standard coverage tools and utilizing low-level Ruby VM interpreter events, this solution significantly reduces testing time while maintaining high performance and correctness. ## The Strategy of Test Impact Analysis * Lengthy CI pipelines (often exceeding 20 minutes) increase the likelihood of intermittent "flaky" failures that are unrelated to current code changes. * While parallelization can reduce time, it increases cloud computing costs and does not mitigate the flakiness of irrelevant tests. * Test impact analysis generates a dynamic map between each test and the source files executed during its run; if a commit doesn't touch those files, the test is safely skipped. * Success depends on three pillars: correctness (never skipping a necessary test), performance (low overhead), and seamlessness (no required code changes for the user). ## Limitations of Standard Coverage Tools * Ruby’s built-in `Coverage` module (enhanced in version 3.1 with `resume`/`suspend` methods) proved incompatible with existing total code coverage tools like `simplecov`. * Initial prototypes using the `Coverage` module showed a performance overhead of 300%, making the test suite four times slower. * The `TracePoint` API was also evaluated as an alternative to spy on code execution via the `line` event, but it still produced a significant median overhead of 200% to 400%. * Benchmarks were conducted using the `rubocop` test suite—a "hard mode" scenario with 20,000+ tests—to ensure the tool could handle high-sensitivity environments. ## Implementing a Custom C Extension * To bypass the limitations of high-level APIs, developers utilized Ruby’s C extension capabilities to hook directly into the Virtual Machine. * The library uses `rb_add_event_hook2` and `rb_thread_add_event_hook` to subscribe to the `RUBY_EVENT_LINE` event at the interpreter level. * The implementation involves a C-based `dd_cov_start` function that triggers when a test begins and a `dd_cov_stop` function to collect the results. * During execution, the tool uses `rb_sourcefile()` to identify the current file and stores it in a Ruby hash only if the file is located within the project’s root directory. For engineering teams struggling with bloated CI pipelines, adopting test impact analysis is a highly effective way to optimize resources. By utilizing tools like Datadog’s Intelligent Test Runner, which leverages low-level VM events for minimal overhead, teams can cut their testing time in half without sacrificing the reliability of their master branch.

datadog

.NET Continuous Profiler: CPU and wall time profiling | Datadog (opens in new tab)

Datadog’s Continuous Profiler timeline view offers a granular look at application performance by mapping code execution directly to a temporal axis. This allows engineers to move beyond aggregate flame graphs to understand exactly when and why specific bottlenecks occur during a request’s lifecycle. By correlating traces with detailed profile data, teams can effectively isolate the root causes of latency spikes and resource exhaustion in live production environments. ### Bridging the Gap Between Tracing and Profiling * While distributed tracing identifies which service or span is slow, profiling explains the "why" by showing execution at the method and line level. * The timeline view integrates profile data with specific trace spans, allowing users to zoom into the exact millisecond a performance degradation began. * By toggling between CPU time and wall time, developers can distinguish between active computation and passive waiting, providing a clearer picture of thread state. ### Visualizing CPU-Bound Inefficiencies * The tool identifies "hot" methods that consume excessive CPU cycles, such as inefficient regular expressions, heavy JSON serialization, or intensive cryptographic operations. * It detects transient CPU spikes that might be averaged out or hidden in traditional 60-second aggregate profiles. * Engineers can correlate CPU usage with specific threads to identify background tasks or "noisy neighbor" processes that impact the responsiveness of the main application logic. ### Diagnosing Wall Time and Runtime Overhead * Wall time analysis reveals where threads are blocked by external factors like I/O operations, database wait times, or mutex lock contention. * The view surfaces runtime-specific issues such as Garbage Collection (GC) pauses and Safepoint intervals that halt execution across the entire virtual machine. * This visibility is critical for troubleshooting synchronization issues where a thread is idle and waiting for a resource, a scenario that often causes high latency without showing up in CPU-only profiles. To maintain high availability and performance, organizations should integrate continuous profiling into their standard troubleshooting workflows, enabling a seamless transition from detecting a slow trace to identifying the offending line of code or runtime event.