datadog

2023-03-08 incident: A deep dive into the platform-level impact | Datadog (opens in new tab)

The March 2023 Datadog outage was triggered by a simultaneous, global failure across multiple cloud providers and regions, caused by an unexpected interaction between a systemd security patch and Ubuntu 22.04’s default networking behavior. While Datadog typically employs rigorous, staged rollouts for infrastructure changes, the automated nature of OS-level security updates bypassed these controls. The incident highlights the hidden risks in system-level defaults and the potential for "unattended upgrades" to create synchronized failures across supposedly isolated environments.

The systemd-networkd Routing Change

  • In December 2020, systemd version 248 introduced a change where systemd-networkd flushes all IP routing rules it does not recognize upon startup.
  • Version 249 introduced the ManageForeignRoutingPolicyRules setting, which defaults to "yes," confirming this management behavior for any rules not explicitly defined in systemd configuration files.
  • These changes were backported to earlier versions (v247 and v248) but were notably absent from v245, the version used in Ubuntu 20.04.

Dormant Risks in the Ubuntu 22.04 Migration

  • Datadog began migrating its fleet from Ubuntu 20.04 to 22.04 in late 2022, eventually reaching 90% coverage across its infrastructure.
  • Ubuntu 22.04 utilizes systemd v249, meaning the majority of the fleet was susceptible to the routing rule flushing behavior.
  • The risk remained dormant during the initial rollout because systemd-networkd typically only starts during the initial boot sequence when no complex routing rules have been established yet.

The Trigger: Unattended Upgrades and the CVE Patch

  • On March 7, 2023, a security patch for a systemd CVE was released to the Ubuntu security repositories.
  • Datadog’s fleet used the Ubuntu default configuration for unattended-upgrades, which automatically installs security-labeled patches once a day, typically between 06:00 and 07:00 UTC.
  • The installation of the patch forced a restart of the systemd-networkd service on active, running nodes.
  • Upon restarting, the service identified existing IP routing rules (crucial for container networking) as "foreign" and deleted them, effectively severing network connectivity for the nodes.

Failure of Regional Isolation

  • Because the security patch was released globally and the automated upgrade window was synchronized across regions, the failure occurred nearly simultaneously worldwide.
  • This automation bypassed Datadog’s standard practice of "baking" changes in staging and experimental clusters for weeks before proceeding to production.
  • Nodes on the older Ubuntu 20.04 (systemd v245) were unaffected by the patch, as that version of systemd does not flush IP rules upon a service restart.

To mitigate similar risks, infrastructure teams should consider explicitly disabling the management of foreign routing rules in systemd-networkd configuration when using third-party networking plugins. Furthermore, while automated security patching is a best practice, organizations must balance the speed of patching with the need for controlled, staged rollouts to prevent global configuration drift or synchronized failures.