flaky-tests

1 posts

datadog

How we built a Ruby library that saves 50% in testing time | Datadog (opens in new tab)

Lengthy CI pipelines and flaky tests often hinder developer productivity by causing unnecessary wait times and costly infrastructure usage. To address this, Datadog developed a Ruby test impact analysis library that dynamically maps tests to specific source files, allowing the CI runner to skip tests unrelated to the latest code changes. By moving beyond standard coverage tools and utilizing low-level Ruby VM interpreter events, this solution significantly reduces testing time while maintaining high performance and correctness. ## The Strategy of Test Impact Analysis * Lengthy CI pipelines (often exceeding 20 minutes) increase the likelihood of intermittent "flaky" failures that are unrelated to current code changes. * While parallelization can reduce time, it increases cloud computing costs and does not mitigate the flakiness of irrelevant tests. * Test impact analysis generates a dynamic map between each test and the source files executed during its run; if a commit doesn't touch those files, the test is safely skipped. * Success depends on three pillars: correctness (never skipping a necessary test), performance (low overhead), and seamlessness (no required code changes for the user). ## Limitations of Standard Coverage Tools * Ruby’s built-in `Coverage` module (enhanced in version 3.1 with `resume`/`suspend` methods) proved incompatible with existing total code coverage tools like `simplecov`. * Initial prototypes using the `Coverage` module showed a performance overhead of 300%, making the test suite four times slower. * The `TracePoint` API was also evaluated as an alternative to spy on code execution via the `line` event, but it still produced a significant median overhead of 200% to 400%. * Benchmarks were conducted using the `rubocop` test suite—a "hard mode" scenario with 20,000+ tests—to ensure the tool could handle high-sensitivity environments. ## Implementing a Custom C Extension * To bypass the limitations of high-level APIs, developers utilized Ruby’s C extension capabilities to hook directly into the Virtual Machine. * The library uses `rb_add_event_hook2` and `rb_thread_add_event_hook` to subscribe to the `RUBY_EVENT_LINE` event at the interpreter level. * The implementation involves a C-based `dd_cov_start` function that triggers when a test begins and a `dd_cov_stop` function to collect the results. * During execution, the tool uses `rb_sourcefile()` to identify the current file and stores it in a Ruby hash only if the file is located within the project’s root directory. For engineering teams struggling with bloated CI pipelines, adopting test impact analysis is a highly effective way to optimize resources. By utilizing tools like Datadog’s Intelligent Test Runner, which leverages low-level VM events for minimal overhead, teams can cut their testing time in half without sacrificing the reliability of their master branch.