datadog

2023-03-08 incident: A deep dive into the platform-level impact | Datadog (opens in new tab)

The March 2023 Datadog outage was triggered by a simultaneous, global failure across multiple cloud providers and regions, caused by an unexpected interaction between a systemd security patch and Ubuntu 22.04’s default networking behavior. While Datadog typically employs rigorous, staged rollouts for infrastructure changes, the automated nature of OS-level security updates bypassed these controls. The incident highlights the hidden risks in system-level defaults and the potential for "unattended upgrades" to create synchronized failures across supposedly isolated environments. ## The systemd-networkd Routing Change * In December 2020, systemd version 248 introduced a change where `systemd-networkd` flushes all IP routing rules it does not recognize upon startup. * Version 249 introduced the `ManageForeignRoutingPolicyRules` setting, which defaults to "yes," confirming this management behavior for any rules not explicitly defined in systemd configuration files. * These changes were backported to earlier versions (v247 and v248) but were notably absent from v245, the version used in Ubuntu 20.04. ## Dormant Risks in the Ubuntu 22.04 Migration * Datadog began migrating its fleet from Ubuntu 20.04 to 22.04 in late 2022, eventually reaching 90% coverage across its infrastructure. * Ubuntu 22.04 utilizes systemd v249, meaning the majority of the fleet was susceptible to the routing rule flushing behavior. * The risk remained dormant during the initial rollout because `systemd-networkd` typically only starts during the initial boot sequence when no complex routing rules have been established yet. ## The Trigger: Unattended Upgrades and the CVE Patch * On March 7, 2023, a security patch for a systemd CVE was released to the Ubuntu security repositories. * Datadog’s fleet used the Ubuntu default configuration for `unattended-upgrades`, which automatically installs security-labeled patches once a day, typically between 06:00 and 07:00 UTC. * The installation of the patch forced a restart of the `systemd-networkd` service on active, running nodes. * Upon restarting, the service identified existing IP routing rules (crucial for container networking) as "foreign" and deleted them, effectively severing network connectivity for the nodes. ## Failure of Regional Isolation * Because the security patch was released globally and the automated upgrade window was synchronized across regions, the failure occurred nearly simultaneously worldwide. * This automation bypassed Datadog’s standard practice of "baking" changes in staging and experimental clusters for weeks before proceeding to production. * Nodes on the older Ubuntu 20.04 (systemd v245) were unaffected by the patch, as that version of systemd does not flush IP rules upon a service restart. To mitigate similar risks, infrastructure teams should consider explicitly disabling the management of foreign routing rules in systemd-networkd configuration when using third-party networking plugins. Furthermore, while automated security patching is a best practice, organizations must balance the speed of patching with the need for controlled, staged rollouts to prevent global configuration drift or synchronized failures.

coupang

Coupang Rocket Delivery: A (opens in new tab)

Coupang transitioned its Rocket Delivery management from a text-based zip code system to a spatial index-based system using Uber’s H3 library. This shift addresses the limitations of zip codes, which became too coarse for high-density delivery areas, by enabling precise, map-based visualization and manipulation of delivery zones. By adopting a hexagonal grid-based approach, Coupang has improved operational flexibility and its ability to handle complex urban delivery environments. ### The Limitations of Zip Code Systems * Zip codes originally served as the base unit for Rocket Delivery, but as delivery volumes scaled, individual codes became too large for a single driver to manage. * Sub-dividing these areas (e.g., splitting a zip code into specific apartment complexes or even individual buildings) required the manual expertise of senior managers because text-based addresses lack inherent spatial intelligence. * The previous reliance on text made it difficult to visualize delivery boundaries or reassign areas quickly in response to changes in order volume. ### Implementing H3 for Geospatial Indexing * To modernize the system, Coupang adopted H3, a hexagonal hierarchical geospatial indexing system that converts geographic coordinates into unique cell identifiers. * Hexagons were selected over square grids because they provide uniform distances between the center of a cell and all its neighbors, which minimizes distortion in distance-based calculations. * The system uses H3’s hierarchical structure to manage different levels of detail, allowing the platform to aggregate small hexagonal units into larger, custom-defined delivery polygons. ### Technical Challenges in System Redesign * A primary engineering hurdle was selecting the optimal grid resolution to ensure cells were small enough to capture individual building footprints without creating excessive data overhead. * The team developed algorithms to transform groups of hexagonal indices into filled polygons, enabling camp managers to "draw" and modify delivery zones directly on a digital map. * By basing the system on spatial coordinates rather than administrative text, the platform can dynamically adjust to urban changes, such as the construction of new high-rises or the demolition of old structures. Transitioning from text-based addressing to hexagonal indexing allows logistics platforms to move beyond the constraints of administrative boundaries. For high-density urban delivery services, adopting a spatial-first infrastructure like H3 is a necessary step to ensure scalability and operational precision.

datadog

Making fetch happen: Building a general-purpose query and render scheduler | Datadog (opens in new tab)

Datadog replaced its complex, dashboard-specific scheduling system with a generalized, modular query and render scheduler to improve performance across all its web applications. By simplifying query heuristics and leveraging the Browser Scheduling API for renders, the engineering team achieved a more stable backend load and smoother UI interactions. This transition transformed a brittle set of rules into a scalable framework that optimizes resource utilization based on widget visibility and browser availability. ## Limitations of Legacy Scheduling The original scheduling system was a complex web of over 20 interlinked heuristics that became difficult for developers to maintain or reason about. While it performed better than an unscheduled baseline, it suffered from several structural flaws: * **Tight Coupling:** Query and render logic were unnecessarily linked; for example, fetches were sometimes delayed based on pending render tasks, even when throttling fetches wasn’t necessary. * **Lack of Generalization:** The system was hardcoded specifically for dashboards, making it impossible to use the same optimization benefits for other widget-heavy products in the Datadog suite. * **Inefficient Resource Management:** Renders were often delayed based on arbitrary data size rules rather than the actual real-time availability of the browser's CPU and memory resources. ## A Simplified Query Algorithm To create a more predictable and efficient system, the team stripped away redundant rules—such as manual throttling for unfocused tabs, which modern browsers already handle—and moved to a streamlined query model. The new algorithm is governed by only six parameters: * **Visibility Priority:** Fetches for widgets currently visible in the viewport are executed immediately to ensure a responsive user experience. * **Fixed Time Windows:** Non-visible queries are ranked by enqueue time and processed in 2000ms windows with a limit of 10 tasks per window. * **Error Reduction:** The more stable distribution of tasks significantly reduced "429 (Too many requests)" errors, leading to faster overall data loading since fewer retries are required. * **Framework Integration:** This simplified logic was moved into a standard data-fetching framework, allowing any Datadog product using generalized components to benefit from the scheduler. ## Render Scheduling with the Browser Scheduling API While the query scheduler handles data fetching, a separate render scheduler manages the impact on the browser’s main thread. By moving away from legacy heuristics and adopting the Browser Scheduling API, Datadog can now schedule tasks based on native browser priorities: * **Prioritization:** The API allows developers to categorize tasks as `user-blocking`, `user-visible`, or `background`, ensuring the browser prioritizes critical UI updates while deferring heavy computations to idle periods. * **Resource Awareness:** Unlike the old system, this API is natively aware of CPU and memory pressure, allowing the browser to manage execution timing more effectively than a JavaScript-based heuristic. * **Future-Proofing:** Currently supported in Chromium and Firefox Nightly (with polyfills for others), this approach allows for mass updates to task priorities and the ability to abort stale tasks via `TaskController`. Standardizing on a modular scheduling architecture allows engineering teams to optimize both network traffic and main-thread performance without the maintenance overhead of complex, custom rule sets. For high-density data applications, leveraging native browser APIs for task prioritization is recommended to ensure smooth rendering across varying hardware capabilities.