Datadog / datadog

11 posts

2023-03-08 incident: A deep dive into the platform-level recovery | Datadog (opens in new tab)

Following a massive system-wide outage in March 2023, Datadog successfully restored its EU1 region by identifying that a simple node reboot could resolve network connectivity issues caused by a faulty system patch. While the team managed to restore 100 percent of compute capacity within hours, the recovery effort was subsequently hindered by cloud provider infrastructure limits and IP address exhaustion. This post-mortem highlights the complexities of scaling hierarchical Kubernetes environments under extreme pressure and the importance of accounting for "black swan" capacity requirements. ## Hierarchical Kubernetes Recovery Datadog utilizes a strict hierarchy of Kubernetes clusters to manage its infrastructure, which necessitated a granular, three-tiered recovery approach. Because the outage affected network connectivity via `systemd-networkd`, the team had to restore components in a specific order to regain control of the environment. * **Parent Control Planes:** Engineers first rebooted the virtual machines hosting the parent clusters, which manage the control planes for all other clusters. * **Child Control Planes:** Once parent clusters were stable, the team restored the control planes for application clusters, which run as pods within the parent infrastructure. * **Application Worker Nodes:** Thousands of worker nodes across dozens of clusters were restarted progressively to avoid overwhelming the control planes, reaching full capacity by 12:05 UTC. ## Scaling Bottlenecks and Cloud Quotas Once the infrastructure was online, the team attempted to scale out rapidly to process a massive backlog of buffered data. This surge in demand triggered previously unencountered limitations within the Google Cloud environment. * **VPC Peering Limits:** At 14:18 UTC, the platform hit a documented but overlooked limit of 15,500 VM instances within a single network peering group, blocking all further scaling. * **Provider Intervention:** Datadog worked directly with Google Cloud support to manually raise the peering group limit, which allowed scaling to resume after a nearly four-hour delay. ## IP Address and Subnet Capacity Even after cloud-level instance quotas were lifted, specific high-traffic clusters processing logs and traces hit a secondary bottleneck related to internal networking. * **Subnet Exhaustion:** These clusters attempted to scale to more than twice their normal size, quickly exhausting all available IP addresses in their assigned subnets. * **Capacity Planning Gaps:** While Datadog typically targets a 66% maximum IP usage to allow for a 50% scale-out, the extreme demands of the recovery backlog exceeded these safety margins. * **Impact on Backlog:** For six hours, the lack of available IPs forced these clusters to process data significantly slower than the rest of the recovered infrastructure. ## Recovery Summary The EU1 recovery demonstrates that even when hardware is functional, software-defined limits can create cascading delays. Organizations should not only monitor their own resource usage but also maintain visibility into cloud provider quotas and ensure that subnet allocations account for extreme recovery scenarios where workloads may need to double or triple in size momentarily.

Robust statistical distances for machine learning | Datadog (opens in new tab)

Datadog has introduced Toto, a new open-weights foundation model specifically designed for time-series forecasting and anomaly detection within observability contexts. While general-purpose time-series models often struggle with the unique volatility and high-frequency patterns of IT telemetry, Toto is pre-trained on a massive dataset of 500 billion observations to provide superior zero-shot performance. This release, accompanied by the BOOM benchmark, addresses the critical need for specialized AI tools capable of handling the complexity of modern cloud infrastructure. ### Toto Model Architecture and Training * Toto utilizes a decoder-only transformer architecture, adapting large language model (LLM) principles to the domain of continuous numerical data. * The model employs a "patching" mechanism, which groups multiple time-series data points into single tokens to improve computational efficiency and allow the model to capture longer historical dependencies. * It incorporates Rotary Positional Embeddings (RoPE) to better handle sequences of varying lengths and maintain temporal relationships across different frequencies. * Training was conducted on a curated dataset of 500 billion anonymized data points from real-world observability metrics, including CPU usage, memory consumption, and network traffic. ### Specialized Observability Features * Unlike existing models like TimesFM or Chronos, which are trained on diverse but general datasets like weather or retail trends, Toto is optimized for the specific "spikiness" and abrupt level shifts common in IT environments. * The model supports zero-shot forecasting, allowing users to generate predictions for new metrics immediately without the need for expensive or time-consuming fine-tuning. * Toto is designed to handle varying sampling rates, from one-second intervals to hourly aggregations, making it versatile across different infrastructure layers. * The open-weights release on Hugging Face allows researchers and engineers to integrate the model into their own AIOps workflows or private cloud environments. ### The BOOM Evaluation Framework * Datadog released the Benchmarking Observability Models (BOOM) framework to provide a standardized method for evaluating time-series models on infrastructure-specific tasks. * BOOM focuses on metrics that represent real-world operational challenges, such as seasonal traffic patterns and sudden system failures. * Comparative testing shows that Toto consistently outperforms general-purpose models in Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) when applied to observability datasets. * The benchmark provides a transparent way for the industry to measure progress in time-series foundation models, moving beyond generic datasets that do not reflect the realities of microservices and distributed systems. Organizations looking to automate capacity planning, optimize cloud spend, or implement intelligent alerting should consider adopting Toto for their time-series analysis. By utilizing the open-weights model alongside the BOOM benchmark, teams can achieve high-accuracy forecasting and objective performance validation without the overhead of building specialized models from scratch.

Our journey taking Kubernetes state metrics to the next level | Datadog (opens in new tab)

Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes. ### Challenges with KSM Scalability * KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size. * In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics. * In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes. * The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity. ### Limitations of Legacy Implementations * KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation. * The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time. * To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments). * This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization. ### Architectural Improvements in KSM v2.0 * Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design. * The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time. * By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins. * The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced. Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.

Making fetch happen: Building a general-purpose query and render scheduler | Datadog (opens in new tab)

Datadog replaced its complex, dashboard-specific scheduling system with a generalized, modular query and render scheduler to improve performance across all its web applications. By simplifying query heuristics and leveraging the Browser Scheduling API for renders, the engineering team achieved a more stable backend load and smoother UI interactions. This transition transformed a brittle set of rules into a scalable framework that optimizes resource utilization based on widget visibility and browser availability. ## Limitations of Legacy Scheduling The original scheduling system was a complex web of over 20 interlinked heuristics that became difficult for developers to maintain or reason about. While it performed better than an unscheduled baseline, it suffered from several structural flaws: * **Tight Coupling:** Query and render logic were unnecessarily linked; for example, fetches were sometimes delayed based on pending render tasks, even when throttling fetches wasn’t necessary. * **Lack of Generalization:** The system was hardcoded specifically for dashboards, making it impossible to use the same optimization benefits for other widget-heavy products in the Datadog suite. * **Inefficient Resource Management:** Renders were often delayed based on arbitrary data size rules rather than the actual real-time availability of the browser's CPU and memory resources. ## A Simplified Query Algorithm To create a more predictable and efficient system, the team stripped away redundant rules—such as manual throttling for unfocused tabs, which modern browsers already handle—and moved to a streamlined query model. The new algorithm is governed by only six parameters: * **Visibility Priority:** Fetches for widgets currently visible in the viewport are executed immediately to ensure a responsive user experience. * **Fixed Time Windows:** Non-visible queries are ranked by enqueue time and processed in 2000ms windows with a limit of 10 tasks per window. * **Error Reduction:** The more stable distribution of tasks significantly reduced "429 (Too many requests)" errors, leading to faster overall data loading since fewer retries are required. * **Framework Integration:** This simplified logic was moved into a standard data-fetching framework, allowing any Datadog product using generalized components to benefit from the scheduler. ## Render Scheduling with the Browser Scheduling API While the query scheduler handles data fetching, a separate render scheduler manages the impact on the browser’s main thread. By moving away from legacy heuristics and adopting the Browser Scheduling API, Datadog can now schedule tasks based on native browser priorities: * **Prioritization:** The API allows developers to categorize tasks as `user-blocking`, `user-visible`, or `background`, ensuring the browser prioritizes critical UI updates while deferring heavy computations to idle periods. * **Resource Awareness:** Unlike the old system, this API is natively aware of CPU and memory pressure, allowing the browser to manage execution timing more effectively than a JavaScript-based heuristic. * **Future-Proofing:** Currently supported in Chromium and Firefox Nightly (with polyfills for others), this approach allows for mass updates to task priorities and the ability to abort stale tasks via `TaskController`. Standardizing on a modular scheduling architecture allows engineering teams to optimize both network traffic and main-thread performance without the maintenance overhead of complex, custom rule sets. For high-density data applications, leveraging native browser APIs for task prioritization is recommended to ensure smooth rendering across varying hardware capabilities.

How we built a Ruby library that saves 50% in testing time | Datadog (opens in new tab)

Lengthy CI pipelines and flaky tests often hinder developer productivity by causing unnecessary wait times and costly infrastructure usage. To address this, Datadog developed a Ruby test impact analysis library that dynamically maps tests to specific source files, allowing the CI runner to skip tests unrelated to the latest code changes. By moving beyond standard coverage tools and utilizing low-level Ruby VM interpreter events, this solution significantly reduces testing time while maintaining high performance and correctness. ## The Strategy of Test Impact Analysis * Lengthy CI pipelines (often exceeding 20 minutes) increase the likelihood of intermittent "flaky" failures that are unrelated to current code changes. * While parallelization can reduce time, it increases cloud computing costs and does not mitigate the flakiness of irrelevant tests. * Test impact analysis generates a dynamic map between each test and the source files executed during its run; if a commit doesn't touch those files, the test is safely skipped. * Success depends on three pillars: correctness (never skipping a necessary test), performance (low overhead), and seamlessness (no required code changes for the user). ## Limitations of Standard Coverage Tools * Ruby’s built-in `Coverage` module (enhanced in version 3.1 with `resume`/`suspend` methods) proved incompatible with existing total code coverage tools like `simplecov`. * Initial prototypes using the `Coverage` module showed a performance overhead of 300%, making the test suite four times slower. * The `TracePoint` API was also evaluated as an alternative to spy on code execution via the `line` event, but it still produced a significant median overhead of 200% to 400%. * Benchmarks were conducted using the `rubocop` test suite—a "hard mode" scenario with 20,000+ tests—to ensure the tool could handle high-sensitivity environments. ## Implementing a Custom C Extension * To bypass the limitations of high-level APIs, developers utilized Ruby’s C extension capabilities to hook directly into the Virtual Machine. * The library uses `rb_add_event_hook2` and `rb_thread_add_event_hook` to subscribe to the `RUBY_EVENT_LINE` event at the interpreter level. * The implementation involves a C-based `dd_cov_start` function that triggers when a test begins and a `dd_cov_stop` function to collect the results. * During execution, the tool uses `rb_sourcefile()` to identify the current file and stores it in a Ruby hash only if the file is located within the project’s root directory. For engineering teams struggling with bloated CI pipelines, adopting test impact analysis is a highly effective way to optimize resources. By utilizing tools like Datadog’s Intelligent Test Runner, which leverages low-level VM events for minimal overhead, teams can cut their testing time in half without sacrificing the reliability of their master branch.

How we use Vale to improve our documentation editing process | Datadog (opens in new tab)

To manage a high volume of technical content across dozens of products, Datadog’s documentation team has automated its editorial process using the open-source linting tool Vale. By integrating these checks directly into their CI/CD pipeline via GitHub Actions, the team ensures prose consistency and clarity while significantly reducing the manual burden on technical writers. This "shift-left" approach empowers both internal and external contributors to identify and fix style issues independently before a formal human review begins. ### Scaling Documentation Workflows * The Datadog documentation team operates at a 200:1 developer-to-writer ratio, managing over 1,400 contributors and 35 distinct products. * In 2023 alone, the team merged over 20,000 pull requests covering 650 integrations, 400 security rules, and 65 API endpoints. * On-call writers review an average of 40 pull requests per day, necessitating automation to handle triaging and style enforcement efficiently. ### Automated Prose Review with Vale * Vale is implemented as a command-line tool and a GitHub Action that scans Markdown and HTML files for style violations. * When a contributor opens a pull request, the linter provides automated comments in the "Files Changed" tab, flagging long sentences, wordy phrasing, or legacy formatting habits. * This automation reduces the "mental toll" on writers by filtering out repetitive errors before they reach the human review stage. ### Codifying Style Guides into Rules * The team transitioned from static editorial guidelines stored in Confluence and wikis to a codified repository called `datadog-vale`. * Style rules are defined using Vale’s YAML specification, allowing the team to update global standards in a single location that is immediately active in the CI pipeline. * Custom regular expressions are used to exclude specific content from validation, such as Hugo shortcodes or technical snippets that do not follow standard prose rules. ### Implementation of Specific Linting Rules * **Jargon and Filler Words:** A `words.yml` file flags "cruft" such as "easily" or "simply" to maintain a professional, objective tone. * **Oxford Comma Enforcement:** The `oxfordcomma.yml` rule uses regex to identify lists missing a serial comma and provides a suggestion to the author. * **Latin Abbreviations:** The `abbreviations.yml` rule identifies terms like "e.g." or "i.e." and suggests plain English alternatives like "for example" or "that is." * **Timelessness:** Rules flag words like "currently" or "now" to ensure documentation remains relevant without frequent updates. By open-sourcing their Vale configurations, Datadog provides a framework for other organizations to automate their style guides and foster a more efficient, collaborative documentation culture. Teams looking to improve prose quality should consider adopting a similar "docs-as-code" approach to shift editorial effort toward the beginning of the contribution lifecycle.

How we optimized our Akka application using Datadog’s Continuous Profiler | Datadog (opens in new tab)

Datadog engineers discovered a significant 20–30% CPU overhead in their Akka-based Java applications caused by inefficient thread management within the `ForkJoinPool`. Through continuous profiling, the team found that irregular task flows were forcing the runtime to waste cycles constantly parking and unparking threads. By migrating bursty actors to a dispatcher with a more stable workload, they achieved a major performance gain, illustrating how high-level framework abstractions can mask low-level resource bottlenecks. ### Identifying the Performance Bottleneck * While running A/B tests on a new log-parsing algorithm, the team noticed that expected CPU reductions did not materialize; in some cases, performance actually degraded. * Flame graphs revealed that the application was spending a disproportionate amount of CPU time inside the `ForkJoinPool.scan()` and `Unsafe.park()` methods. * A summary table of CPU usage by thread showed that the "work" pool was only using 1% of the CPU, while the default Akka dispatcher was the primary consumer of resources. * The investigation narrowed the cause down to the `LatencyReportActor`, which handled latency metrics for log events. ### Analyzing the Root Cause of Thread Fluctuations * The `ForkJoinPool` manages worker threads dynamically, calling `Unsafe.park()` to suspend idle threads and `Unsafe.unpark()` to resume them when tasks increase. * The `LatencyReportActor` exhibited an irregular task flow, processing several hundred events in milliseconds and then remaining idle until the next second. * Because the default dispatcher was configured to use a thread pool equal to the number of processor cores (32), the system was waking up 32 threads every second for a tiny burst of work. * This constant cycle of waking and suspending threads created massive CPU overhead through expensive native calls to the operating system's thread scheduler. ### Implementing a Configuration-Based Fix * The solution involved moving the `LatencyReportActor` from the default Akka dispatcher to the main "work" dispatcher. * Because the "work" dispatcher already maintained a consistent flow of log processing tasks, the threads remained active and did not trigger the frequent park/unpark logic. * A single-line configuration change was used to route the actor to the stable dispatcher. * Following the change, the default dispatcher’s thread pool shrank from 32 to 2 threads, and overall service CPU usage dropped by an average of 30%. To maintain optimal performance in applications using `ForkJoinPool` or Akka, developers should monitor the `ForkJoinPool.scan()` method; if it accounts for more than 10–15% of CPU usage, the thread pool is likely unstable. Recommendations for remediation include limiting the number of actor instances, capping the maximum threads in a pool, and utilizing task queues to buffer short spikes. The ultimate goal is to ensure a stable count of active threads and avoid the performance tax of frequent thread state transitions.

2023-03-08 incident: A deep dive into our incident response | Datadog (opens in new tab)

Datadog’s first global outage on March 8, 2023, served as a rigorous stress test for their established incident response framework and "you build it, you own it" philosophy. While the outage was triggered by a systemic failure during a routine systemd upgrade, the company's commitment to blameless culture and decentralized engineering autonomy allowed hundreds of responders to coordinate a complex recovery across multiple regions. Ultimately, the event validated their investment in out-of-band monitoring and rigorous, bi-annual incident training as essential components for managing high-scale system disasters. ## Incident Response Structure and Philosophy * Datadog employs a decentralized "you build it, you own it" model where individual engineering teams are responsible for the 24/7 health and monitoring of the services they build. * For high-severity incidents, a specialized rotation is paged, consisting of an Incident Commander to lead the response, a communications lead, and a customer liaison to manage external messaging. * The organization prioritizes "people over process," empowering engineers to use their judgment to find creative solutions rather than following rigid, pre-written playbooks that may not apply to unprecedented failures. * A blameless culture is strictly maintained across all levels of the company, ensuring that post-incident investigations focus on systemic improvements rather than assigning fault to individuals. ## Multi-Layered Monitoring Strategy * Standard telemetry provides internal visibility, but Datadog also maintains "out-of-band" monitoring that operates completely outside its own infrastructure. * This out-of-band system interacts with Datadog APIs exactly like a customer would, ensuring that engineers are alerted even if the internal monitoring platform itself becomes unavailable. * Communication is streamlined through a dedicated Slack incident app that automatically generates coordination channels, providing situational awareness to any engineer who joins the effort. ## Anatomy of the March 8 Outage * The outage began at 06:00 UTC, triggered by a systemd upgrade that caused widespread Kubernetes failures and prevented pods from restarting correctly. * The global nature of the outage was diagnosed within 32 minutes of the initial monitoring alerts, leading to the activation of executive on-calls and the customer support management team. * Responders identified "unattended upgrades" as the incident trigger approximately five and a half hours after the initial failure. * Recovery was executed in stages: compute capacity was restored first in the EU1 region, followed by the US1 region, with full infrastructure restoration completed by 19:00 UTC. Organizations should treat incident response as a perishable skill that requires constant practice through a low threshold for declaring incidents and regular training. By combining out-of-band monitoring with a culture that empowers individual engineers to act autonomously during a crisis, teams can more effectively navigate the "not if, but when" reality of large-scale system failures.

How Datadog uses Datadog to gain visibility into the Datadog user experience | Datadog (opens in new tab)

Datadog leverages its own monitoring tools to bridge the gap between qualitative user interviews and quantitative performance data. By "dogfooding" features like Real User Monitoring (RUM) and Logs, the product design team makes evidence-based UI/UX adjustments while gaining firsthand empathy for the user experience. This approach allows them to identify exactly how users interact with specific components and where current designs fail to meet user expectations. **Optimizing Font Consistency via CSS API Tracking** * To ensure visual precision in information-dense views like the Log Explorer, the team needed to transition from a generic system font stack to a standardized monospace font. * Designers used the Web API’s `Document.font` interface and the CSS Font Loading API via Datadog RUM to collect data on which specific fonts were actually being rendered on users' machines. * By analyzing a dashboard of these results, the team selected Roboto Mono as the standard, ensuring the new font’s optical size matched what the plurality of users were already seeing to avoid breaking embedded tables. **Simplifying Components through Interaction Logging** * The `DraggablePane` component, used for resizing adjacent panels, was suffering from UI clutter due to physical buttons for minimizing and maximizing content. * The team implemented custom loggers within Datadog Logs to track how frequently users clicked these specific controls versus interacting with the draggable handle. * The data revealed that the buttons were almost never used; consequently, the team removed them and replaced the functionality with a double-click event, significantly streamlining the interface. **Refining Syntax Support through Error Analysis** * When introducing the `DateRangePicker` for custom time frames, the team needed to expand the component's logic to support natural language strings. * By aggregating "invalid inputs" in Datadog Logs, the team could see the exact strings users were typing—such as "last 2 weeks"—that the system failed to parse. * Analyzing these common patterns allowed the team to update the parsing logic for high-demand keywords, which resulted in the component’s error rate dropping from 10 percent to approximately 5 percent. Leveraging internal monitoring tools allows design teams to move beyond guesswork and create highly functional interfaces. For organizations managing complex technical products, tracking specific component failures and interaction frequencies is an essential strategy for prioritizing the design roadmap and improving user retention.

How we migrated our acceptance tests to use Synthetic Monitoring | Datadog (opens in new tab)

Datadog’s Frontend Developer Experience team migrated their massive codebase from a fragile, custom Puppeteer-based acceptance testing framework to Datadog Synthetic Monitoring to address persistent flakiness and high maintenance overhead. By leveraging a record-and-play approach and integrating it into their CI/CD pipelines via the `datadog-ci` tool, they successfully reduced developer friction and improved testing reliability for over 300 engineers. This transition demonstrates how replacing manual browser scripting with specialized monitoring tools can significantly streamline high-scale frontend workflows. ### Limitations of Puppeteer-Based Testing * Custom runners built on Puppeteer suffered from inherent flakiness because they relied on a complex chain of virtual graphic engines, browser manipulation, and network stability that frequently failed unexpectedly. * Writing tests was unintuitive, requiring engineers to manually script interaction details—such as verifying if a button is present and enabled before clicking—which became exponentially more complex for custom elements like dropdowns. * The testing infrastructure was slow and expensive, with CI jobs taking up to 35 minutes of machine time per commit to cover the application's 565 tests and 100,000 lines of test code. * Maintenance was a constant burden; every product update required a corresponding manual update to the scripts, making the process as labor-intensive as writing new features. ### Adopting Synthetic Monitoring and Tooling * The team moved to Synthetic Monitoring, which allows engineers to record browser interactions directly rather than writing code, significantly lowering the barrier to entry for creating tests. * To integrate these tests into the development lifecycle, the team developed `datadog-ci`, a CLI tool designed to trigger tests and poll result statuses directly from the CI environment. * The new system uses a specific file format (`.synthetics.json`) to identify tests within the codebase, allowing for configuration overrides and human-readable output in the build logs. * This transition turned an internal need into a product improvement, as the `datadog-ci` tool was generalized to help all Datadog users execute commands from within their CI/CD scripts. ### Strategies for High-Scale Migration and Adoption * The team utilized comprehensive documentation and internal "frontend gatherings" to educate 300 engineers on how to record tests and why the new system required less maintenance. * To build developer trust, the team initially implemented the new tests as non-blocking CI jobs, surfacing failures as PR comments rather than breaking builds. * Migration was treated as a distributed effort, with 565 individual tests tracked via Jira and assigned to their respective product teams to ensure ownership and a steady pace. * By progressively sunsetting the old platform as tests were migrated, the team managed a year-long transition without disrupting the daily output of 160 authors pushing 90 new PRs every day. To successfully migrate large-scale testing infrastructures, organizations should prioritize developer trust by introducing new tools through non-blocking pipelines and providing comprehensive documentation. Transitioning from manual browser scripting to automated recording tools not only reduces technical debt but also empowers engineers to maintain high-quality codebases without the burden of managing complex testing infrastructure.

Cheering on coworkers: Building culture with Datadog dashboards | Datadog (opens in new tab)

Datadog engineers developed a real-time tracking dashboard to monitor a colleague’s progress during an 850km, six-day ultra-marathon challenge. By scraping public race statistics and piping the data into their monitoring platform, the team created a centralized visualization tool to provide remote support and office-wide engagement. ### Data Extraction and Parsing The team needed to harvest race data that was only available as plain HTML on the event’s official website. * A crawler was built using the Python `Requests` library to automate the retrieval of the webpage's source code. * The team utilized `BeautifulSoup` to parse the HTML and isolate specific data points, such as the runner's current ranking and total distance covered. ### Ingesting Metrics with StatsD Once the data was structured, it was converted into telemetry using the Datadog agent and the `statsd` Python library. * The script utilized `dog.gauge` to emit three primary metrics: `runner.distance`, `runner.ranking`, and `runner.elapsed_time`. * Each metric was assigned a "name" tag corresponding to the runner, allowing the team to filter data and compare participants within the Datadog interface. * The data was updated periodically to ensure the dashboard reflected the most current race standings. ### Dashboard Visualization and Results The final phase involved synthesizing the metrics into a high-visibility dashboard displayed in the company’s New York and Paris offices. * The dashboard combined technical performance graphs with multimedia elements, including live video feeds and GIFs, to create an interactive cheering station. * The system successfully tracked the athlete's 47km lead in real-time, providing the team with immediate updates on his physical progress and elapsed time over the 144-hour event. This project demonstrates how standard observability tools can be repurposed for creative "life-graphing" applications. By combining simple web scraping with metric ingestion, engineers can quickly build custom monitoring solutions for any public data source.