datadog - Datadog | Techlist.io

datadog Apr 21, 2026

Steganography at scale: Embedding share URLs in Datadog widget screenshots (opens in new tab)

Did you find this article helpful? Further Reading Datadog Platform Datasheet Learn about the key components, capabilities, and features of the Datadog platform. Bartol Rebernjak Galen Pickard Sharing a visualization from a Datadog dashboard or notebook is simple. You can copy a…

datadog redis incident-management watermarking+3

datadog Apr 21, 2026

Steganography at scale: Embedding share URLs in Datadog widget screenshots (opens in new tab)

Bartol Rebernjak Galen Pickard Sharing a visualization from a Datadog dashboard or notebook is simple. You can copy and paste a widget directly into another dashboard, message, or document. Behind the scenes, the widget is stored with a unique sharing URL, which the Datadog app…

datadog performance-optimization redis watermarking+3

datadog Apr 7, 2026

How we built a real-world evaluation platform for autonomous SRE agents at scale (opens in new tab)

Benjamin Barton We shipped a feature that made perfect sense. It improved a specific type of investigation we had been testing against. Then other investigations started getting worse. Nothing crashed. No tests failed. But the overall quality of the agent had shifted, and we had…

datadog k8s kafka sre+3

datadog Mar 4, 2026

Designing MCP tools for agents: Lessons from building Datadog's MCP server (opens in new tab)

Reilly Wood I work on Datadog's official MCP (Model Context Protocol) server, our first observability interface designed specifically for customers' AI agents. Our first version was a thin wrapper around existing APIs, the kind of thing you can build in a weekend. It worked well…

datadog database-design observability api-design+4

datadog Aug 12, 2025

Scaling down to speed up: How we improved efficiency of live process metrics by 100x (opens in new tab)

Kai Zong Khor William Yu Many Datadog products offer a live view of their telemetry, allowing you to access your data in near real time from across your infrastructure. Live views improve responsiveness, but they also introduce strict requirements on data ingestion latency and s…

datadog database-design docker kafka+4

datadog Jun 23, 2025

How we built reliable log delivery to thousands of unpredictable endpoints (opens in new tab)

Gabriel Reid Package delivery services like UPS, FedEx, and the postal service have a tough job. They contend with a never-ending stream of packages to be delivered, with expectations of prompt and reliable delivery. In building the Datadog Log Forwarding feature, we had to cont…

datadog database-design kafka json+4

datadog Jun 17, 2025

How we scaled fast, reliable configuration distribution to thousands of workload containers (opens in new tab)

Gabriel Reid When you think about engineering challenges faced in handling incoming log data in Datadog, one of the first things that comes to mind is the scale: quickly and reliably parsing and processing millions of logs per second over thousands of containers. By comparison,…

datadog database-design observability kafka+4

datadog Apr 9, 2025

Squeezing every millisecond: How we rebuilt the Datadog Lambda Extension in Rust (opens in new tab)

AJ Stuyvenberg Jordan González Building serverless applications offers developers flexibility, scalability, and a smooth development experience. But while serverless environments automatically scale with demand, they also introduce resource constraints that can impact performanc…

datadog database-design observability serverless+3

datadog Nov 1, 2024

How we built a Ruby library that saves 50% in testing time (opens in new tab)

Andrey Marchenko Gillian McGarvey Do you know that feeling when the coding is done and the pull request is approved, and you only need a green pipeline for the merge to be complete? Then the dreaded sequence of events occurs: running your test suite takes 20+ minutes and then fa…

datadog performance-optimization cloud-computing continuous-integration+4

datadog Nov 1, 2024

How we built a Ruby library that saves 50% in testing time | Datadog (opens in new tab)

Lengthy CI pipelines and flaky tests often hinder developer productivity by causing unnecessary wait times and costly infrastructure usage. To address this, Datadog developed a Ruby test impact analysis library that dynamically maps tests to specific source files, allowing the CI runner to skip tests unrelated to the latest code changes. By moving beyond standard coverage tools and utilizing low-level Ruby VM interpreter events, this solution significantly reduces testing time while maintaining high performance and correctness. ## The Strategy of Test Impact Analysis * Lengthy CI pipelines (often exceeding 20 minutes) increase the likelihood of intermittent "flaky" failures that are unrelated to current code changes. * While parallelization can reduce time, it increases cloud computing costs and does not mitigate the flakiness of irrelevant tests. * Test impact analysis generates a dynamic map between each test and the source files executed during its run; if a commit doesn't touch those files, the test is safely skipped. * Success depends on three pillars: correctness (never skipping a necessary test), performance (low overhead), and seamlessness (no required code changes for the user). ## Limitations of Standard Coverage Tools * Ruby’s built-in `Coverage` module (enhanced in version 3.1 with `resume`/`suspend` methods) proved incompatible with existing total code coverage tools like `simplecov`. * Initial prototypes using the `Coverage` module showed a performance overhead of 300%, making the test suite four times slower. * The `TracePoint` API was also evaluated as an alternative to spy on code execution via the `line` event, but it still produced a significant median overhead of 200% to 400%. * Benchmarks were conducted using the `rubocop` test suite—a "hard mode" scenario with 20,000+ tests—to ensure the tool could handle high-sensitivity environments. ## Implementing a Custom C Extension * To bypass the limitations of high-level APIs, developers utilized Ruby’s C extension capabilities to hook directly into the Virtual Machine. * The library uses `rb_add_event_hook2` and `rb_thread_add_event_hook` to subscribe to the `RUBY_EVENT_LINE` event at the interpreter level. * The implementation involves a C-based `dd_cov_start` function that triggers when a test begins and a `dd_cov_stop` function to collect the results. * During execution, the tool uses `rb_sourcefile()` to identify the current file and stores it in a Ruby hash only if the file is located within the project’s root directory. For engineering teams struggling with bloated CI pipelines, adopting test impact analysis is a highly effective way to optimize resources. By utilizing tools like Datadog’s Intelligent Test Runner, which leverages low-level VM events for minimal overhead, teams can cut their testing time in half without sacrificing the reliability of their master branch.

datadog performance-optimization continuous-integration test-impact-analysis+4

datadog Sep 23, 2024

How we optimized LLM use for cost, quality, and safety to facilitate writing postmortems (opens in new tab)

Tran Le Till Pieper Director, Product Management Gillian McGarvey Writing a postmortem is an essential learning process after an incident is resolved. But documenting important details comprehensively can be cumbersome, especially when responders have already moved on to the nex…

datadog llm incident-management slack+4

datadog Jun 28, 2024

Timeseries indexing at scale (opens in new tab)

Artem Krylysov May Lee Datadog collects billions of events from millions of hosts every minute and that number keeps growing and fast. Our data volumes grew 30x between 2017 and 2022. On top of that, the kind of queries we receive from our users has changed significantly. Why? B…

datadog database-design go kafka+4

datadog May 20, 2024

.NET Continuous Profiler: Memory usage (opens in new tab)

Christophe Nasarre In Part 1 of this series, I presented a high-level overview of the architecture, implementation, and initialization of Datadog's .NET profiler, which consists of several individual profilers that collect data for particular resources. I went on to discuss prof…

datadog performance-optimization garbage-collection dotnet+4

datadog May 1, 2024

How we built the Datadog heatmap to visualize distributions over time at arbitrary scale (opens in new tab)

Joe McCourt Sagar Mohite Austin Lai How do we surface the rich stories hidden within our users' observability data? We can use percentiles to communicate performance for a specific percentage of cases—but for the full shape of performance, we use distribution metrics. These metr…

datadog database-design scalability data-visualization+4

datadog Mar 26, 2024

Engineering spotlight: Marie-Laure Bardonnet (opens in new tab)

Austin Lai Marie-Laure Bardonnet In this edition of the Datadog Engineering Spotlight, Austin from the Community team sat down (virtually) with Marie-Laure Bardonnet. She's a Senior Engineering Manager leading engineering for Datadog's Log Management team, and was once an intern…

datadog log-management career-growth backend-systems+4