Did you find this article helpful? Further Reading Datadog Platform Datasheet Learn about the key components, capabilities, and features of the Datadog platform. Bartol Rebernjak Galen Pickard Sharing a visualization from a Datadog dashboard or notebook is simple. You can copy a…
Bartol Rebernjak Galen Pickard Sharing a visualization from a Datadog dashboard or notebook is simple. You can copy and paste a widget directly into another dashboard, message, or document. Behind the scenes, the widget is stored with a unique sharing URL, which the Datadog app…
Benjamin Barton We shipped a feature that made perfect sense. It improved a specific type of investigation we had been testing against. Then other investigations started getting worse. Nothing crashed. No tests failed. But the overall quality of the agent had shifted, and we had…
Reilly Wood I work on Datadog's official MCP (Model Context Protocol) server, our first observability interface designed specifically for customers' AI agents. Our first version was a thin wrapper around existing APIs, the kind of thing you can build in a weekend. It worked well…
Kai Zong Khor William Yu Many Datadog products offer a live view of their telemetry, allowing you to access your data in near real time from across your infrastructure. Live views improve responsiveness, but they also introduce strict requirements on data ingestion latency and s…
Gabriel Reid Package delivery services like UPS, FedEx, and the postal service have a tough job. They contend with a never-ending stream of packages to be delivered, with expectations of prompt and reliable delivery. In building the Datadog Log Forwarding feature, we had to cont…
Gabriel Reid When you think about engineering challenges faced in handling incoming log data in Datadog, one of the first things that comes to mind is the scale: quickly and reliably parsing and processing millions of logs per second over thousands of containers. By comparison,…
AJ Stuyvenberg Jordan González Building serverless applications offers developers flexibility, scalability, and a smooth development experience. But while serverless environments automatically scale with demand, they also introduce resource constraints that can impact performanc…
Andrey Marchenko Gillian McGarvey Do you know that feeling when the coding is done and the pull request is approved, and you only need a green pipeline for the merge to be complete? Then the dreaded sequence of events occurs: running your test suite takes 20+ minutes and then fa…
Lengthy CI pipelines and flaky tests often hinder developer productivity by causing unnecessary wait times and costly infrastructure usage. To address this, Datadog developed a Ruby test impact analysis library that dynamically maps tests to specific source files, allowing the CI runner to skip tests unrelated to the latest code changes. By moving beyond standard coverage tools and utilizing low-level Ruby VM interpreter events, this solution significantly reduces testing time while maintaining high performance and correctness.
## The Strategy of Test Impact Analysis
* Lengthy CI pipelines (often exceeding 20 minutes) increase the likelihood of intermittent "flaky" failures that are unrelated to current code changes.
* While parallelization can reduce time, it increases cloud computing costs and does not mitigate the flakiness of irrelevant tests.
* Test impact analysis generates a dynamic map between each test and the source files executed during its run; if a commit doesn't touch those files, the test is safely skipped.
* Success depends on three pillars: correctness (never skipping a necessary test), performance (low overhead), and seamlessness (no required code changes for the user).
## Limitations of Standard Coverage Tools
* Ruby’s built-in `Coverage` module (enhanced in version 3.1 with `resume`/`suspend` methods) proved incompatible with existing total code coverage tools like `simplecov`.
* Initial prototypes using the `Coverage` module showed a performance overhead of 300%, making the test suite four times slower.
* The `TracePoint` API was also evaluated as an alternative to spy on code execution via the `line` event, but it still produced a significant median overhead of 200% to 400%.
* Benchmarks were conducted using the `rubocop` test suite—a "hard mode" scenario with 20,000+ tests—to ensure the tool could handle high-sensitivity environments.
## Implementing a Custom C Extension
* To bypass the limitations of high-level APIs, developers utilized Ruby’s C extension capabilities to hook directly into the Virtual Machine.
* The library uses `rb_add_event_hook2` and `rb_thread_add_event_hook` to subscribe to the `RUBY_EVENT_LINE` event at the interpreter level.
* The implementation involves a C-based `dd_cov_start` function that triggers when a test begins and a `dd_cov_stop` function to collect the results.
* During execution, the tool uses `rb_sourcefile()` to identify the current file and stores it in a Ruby hash only if the file is located within the project’s root directory.
For engineering teams struggling with bloated CI pipelines, adopting test impact analysis is a highly effective way to optimize resources. By utilizing tools like Datadog’s Intelligent Test Runner, which leverages low-level VM events for minimal overhead, teams can cut their testing time in half without sacrificing the reliability of their master branch.
Tran Le Till Pieper Director, Product Management Gillian McGarvey Writing a postmortem is an essential learning process after an incident is resolved. But documenting important details comprehensively can be cumbersome, especially when responders have already moved on to the nex…
Artem Krylysov May Lee Datadog collects billions of events from millions of hosts every minute and that number keeps growing and fast. Our data volumes grew 30x between 2017 and 2022. On top of that, the kind of queries we receive from our users has changed significantly. Why? B…
Christophe Nasarre In Part 1 of this series, I presented a high-level overview of the architecture, implementation, and initialization of Datadog's .NET profiler, which consists of several individual profilers that collect data for particular resources. I went on to discuss prof…
Joe McCourt Sagar Mohite Austin Lai How do we surface the rich stories hidden within our users' observability data? We can use percentiles to communicate performance for a specific percentage of cases—but for the full shape of performance, we use distribution metrics. These metr…
Austin Lai Marie-Laure Bardonnet In this edition of the Datadog Engineering Spotlight, Austin from the Community team sat down (virtually) with Marie-Laure Bardonnet. She's a Senior Engineering Manager leading engineering for Datadog's Log Management team, and was once an intern…