datadog

How we migrated our acceptance tests to use Synthetic Monitoring | Datadog (opens in new tab)

Datadog’s Frontend Developer Experience team migrated their massive codebase from a fragile, custom Puppeteer-based acceptance testing framework to Datadog Synthetic Monitoring to address persistent flakiness and high maintenance overhead. By leveraging a record-and-play approach and integrating it into their CI/CD pipelines via the `datadog-ci` tool, they successfully reduced developer friction and improved testing reliability for over 300 engineers. This transition demonstrates how replacing manual browser scripting with specialized monitoring tools can significantly streamline high-scale frontend workflows. ### Limitations of Puppeteer-Based Testing * Custom runners built on Puppeteer suffered from inherent flakiness because they relied on a complex chain of virtual graphic engines, browser manipulation, and network stability that frequently failed unexpectedly. * Writing tests was unintuitive, requiring engineers to manually script interaction details—such as verifying if a button is present and enabled before clicking—which became exponentially more complex for custom elements like dropdowns. * The testing infrastructure was slow and expensive, with CI jobs taking up to 35 minutes of machine time per commit to cover the application's 565 tests and 100,000 lines of test code. * Maintenance was a constant burden; every product update required a corresponding manual update to the scripts, making the process as labor-intensive as writing new features. ### Adopting Synthetic Monitoring and Tooling * The team moved to Synthetic Monitoring, which allows engineers to record browser interactions directly rather than writing code, significantly lowering the barrier to entry for creating tests. * To integrate these tests into the development lifecycle, the team developed `datadog-ci`, a CLI tool designed to trigger tests and poll result statuses directly from the CI environment. * The new system uses a specific file format (`.synthetics.json`) to identify tests within the codebase, allowing for configuration overrides and human-readable output in the build logs. * This transition turned an internal need into a product improvement, as the `datadog-ci` tool was generalized to help all Datadog users execute commands from within their CI/CD scripts. ### Strategies for High-Scale Migration and Adoption * The team utilized comprehensive documentation and internal "frontend gatherings" to educate 300 engineers on how to record tests and why the new system required less maintenance. * To build developer trust, the team initially implemented the new tests as non-blocking CI jobs, surfacing failures as PR comments rather than breaking builds. * Migration was treated as a distributed effort, with 565 individual tests tracked via Jira and assigned to their respective product teams to ensure ownership and a steady pace. * By progressively sunsetting the old platform as tests were migrated, the team managed a year-long transition without disrupting the daily output of 160 authors pushing 90 new PRs every day. To successfully migrate large-scale testing infrastructures, organizations should prioritize developer trust by introducing new tools through non-blocking pipelines and providing comprehensive documentation. Transitioning from manual browser scripting to automated recording tools not only reduces technical debt but also empowers engineers to maintain high-quality codebases without the burden of managing complex testing infrastructure.

datadog

2023-03-08 incident: A deep dive into the platform-level recovery | Datadog (opens in new tab)

Following a massive system-wide outage in March 2023, Datadog successfully restored its EU1 region by identifying that a simple node reboot could resolve network connectivity issues caused by a faulty system patch. While the team managed to restore 100 percent of compute capacity within hours, the recovery effort was subsequently hindered by cloud provider infrastructure limits and IP address exhaustion. This post-mortem highlights the complexities of scaling hierarchical Kubernetes environments under extreme pressure and the importance of accounting for "black swan" capacity requirements. ## Hierarchical Kubernetes Recovery Datadog utilizes a strict hierarchy of Kubernetes clusters to manage its infrastructure, which necessitated a granular, three-tiered recovery approach. Because the outage affected network connectivity via `systemd-networkd`, the team had to restore components in a specific order to regain control of the environment. * **Parent Control Planes:** Engineers first rebooted the virtual machines hosting the parent clusters, which manage the control planes for all other clusters. * **Child Control Planes:** Once parent clusters were stable, the team restored the control planes for application clusters, which run as pods within the parent infrastructure. * **Application Worker Nodes:** Thousands of worker nodes across dozens of clusters were restarted progressively to avoid overwhelming the control planes, reaching full capacity by 12:05 UTC. ## Scaling Bottlenecks and Cloud Quotas Once the infrastructure was online, the team attempted to scale out rapidly to process a massive backlog of buffered data. This surge in demand triggered previously unencountered limitations within the Google Cloud environment. * **VPC Peering Limits:** At 14:18 UTC, the platform hit a documented but overlooked limit of 15,500 VM instances within a single network peering group, blocking all further scaling. * **Provider Intervention:** Datadog worked directly with Google Cloud support to manually raise the peering group limit, which allowed scaling to resume after a nearly four-hour delay. ## IP Address and Subnet Capacity Even after cloud-level instance quotas were lifted, specific high-traffic clusters processing logs and traces hit a secondary bottleneck related to internal networking. * **Subnet Exhaustion:** These clusters attempted to scale to more than twice their normal size, quickly exhausting all available IP addresses in their assigned subnets. * **Capacity Planning Gaps:** While Datadog typically targets a 66% maximum IP usage to allow for a 50% scale-out, the extreme demands of the recovery backlog exceeded these safety margins. * **Impact on Backlog:** For six hours, the lack of available IPs forced these clusters to process data significantly slower than the rest of the recovered infrastructure. ## Recovery Summary The EU1 recovery demonstrates that even when hardware is functional, software-defined limits can create cascading delays. Organizations should not only monitor their own resource usage but also maintain visibility into cloud provider quotas and ensure that subnet allocations account for extreme recovery scenarios where workloads may need to double or triple in size momentarily.

datadog

2023-03-08 incident: A deep dive into our incident response | Datadog (opens in new tab)

Datadog’s first global outage on March 8, 2023, served as a rigorous stress test for their established incident response framework and "you build it, you own it" philosophy. While the outage was triggered by a systemic failure during a routine systemd upgrade, the company's commitment to blameless culture and decentralized engineering autonomy allowed hundreds of responders to coordinate a complex recovery across multiple regions. Ultimately, the event validated their investment in out-of-band monitoring and rigorous, bi-annual incident training as essential components for managing high-scale system disasters. ## Incident Response Structure and Philosophy * Datadog employs a decentralized "you build it, you own it" model where individual engineering teams are responsible for the 24/7 health and monitoring of the services they build. * For high-severity incidents, a specialized rotation is paged, consisting of an Incident Commander to lead the response, a communications lead, and a customer liaison to manage external messaging. * The organization prioritizes "people over process," empowering engineers to use their judgment to find creative solutions rather than following rigid, pre-written playbooks that may not apply to unprecedented failures. * A blameless culture is strictly maintained across all levels of the company, ensuring that post-incident investigations focus on systemic improvements rather than assigning fault to individuals. ## Multi-Layered Monitoring Strategy * Standard telemetry provides internal visibility, but Datadog also maintains "out-of-band" monitoring that operates completely outside its own infrastructure. * This out-of-band system interacts with Datadog APIs exactly like a customer would, ensuring that engineers are alerted even if the internal monitoring platform itself becomes unavailable. * Communication is streamlined through a dedicated Slack incident app that automatically generates coordination channels, providing situational awareness to any engineer who joins the effort. ## Anatomy of the March 8 Outage * The outage began at 06:00 UTC, triggered by a systemd upgrade that caused widespread Kubernetes failures and prevented pods from restarting correctly. * The global nature of the outage was diagnosed within 32 minutes of the initial monitoring alerts, leading to the activation of executive on-calls and the customer support management team. * Responders identified "unattended upgrades" as the incident trigger approximately five and a half hours after the initial failure. * Recovery was executed in stages: compute capacity was restored first in the EU1 region, followed by the US1 region, with full infrastructure restoration completed by 19:00 UTC. Organizations should treat incident response as a perishable skill that requires constant practice through a low threshold for declaring incidents and regular training. By combining out-of-band monitoring with a culture that empowers individual engineers to act autonomously during a crisis, teams can more effectively navigate the "not if, but when" reality of large-scale system failures.