datadog The March 2023 Datadog outage was triggered by a simultaneous, global failure across multiple cloud providers and regions, caused by an unexpected interaction between a systemd security patch and Ubuntu 22.04’s default networking behavior. While Datadog typically employs rigorous, staged rollouts for infrastructure changes, the automated nature of OS-level security updates bypassed these controls. The incident highlights the hidden risks in system-level defaults and the potential for "unattended upgrades" to create synchronized failures across supposedly isolated environments.
## The systemd-networkd Routing Change
* In December 2020, systemd version 248 introduced a change where `systemd-networkd` flushes all IP routing rules it does not recognize upon startup.
* Version 249 introduced the `ManageForeignRoutingPolicyRules` setting, which defaults to "yes," confirming this management behavior for any rules not explicitly defined in systemd configuration files.
* These changes were backported to earlier versions (v247 and v248) but were notably absent from v245, the version used in Ubuntu 20.04.
## Dormant Risks in the Ubuntu 22.04 Migration
* Datadog began migrating its fleet from Ubuntu 20.04 to 22.04 in late 2022, eventually reaching 90% coverage across its infrastructure.
* Ubuntu 22.04 utilizes systemd v249, meaning the majority of the fleet was susceptible to the routing rule flushing behavior.
* The risk remained dormant during the initial rollout because `systemd-networkd` typically only starts during the initial boot sequence when no complex routing rules have been established yet.
## The Trigger: Unattended Upgrades and the CVE Patch
* On March 7, 2023, a security patch for a systemd CVE was released to the Ubuntu security repositories.
* Datadog’s fleet used the Ubuntu default configuration for `unattended-upgrades`, which automatically installs security-labeled patches once a day, typically between 06:00 and 07:00 UTC.
* The installation of the patch forced a restart of the `systemd-networkd` service on active, running nodes.
* Upon restarting, the service identified existing IP routing rules (crucial for container networking) as "foreign" and deleted them, effectively severing network connectivity for the nodes.
## Failure of Regional Isolation
* Because the security patch was released globally and the automated upgrade window was synchronized across regions, the failure occurred nearly simultaneously worldwide.
* This automation bypassed Datadog’s standard practice of "baking" changes in staging and experimental clusters for weeks before proceeding to production.
* Nodes on the older Ubuntu 20.04 (systemd v245) were unaffected by the patch, as that version of systemd does not flush IP rules upon a service restart.
To mitigate similar risks, infrastructure teams should consider explicitly disabling the management of foreign routing rules in systemd-networkd configuration when using third-party networking plugins. Furthermore, while automated security patching is a best practice, organizations must balance the speed of patching with the need for controlled, staged rollouts to prevent global configuration drift or synchronized failures.