2023-03-08 incident: A deep dive into the platform-level impact | Datadog (opens in new tab)
The March 2023 Datadog outage was triggered by a simultaneous, global failure across multiple cloud providers and regions, caused by an unexpected interaction between a systemd security patch and Ubuntu 22.04’s default networking behavior. While Datadog typically employs rigorous, staged rollouts for infrastructure changes, the automated nature of OS-level security updates bypassed these controls. The incident highlights the hidden risks in system-level defaults and the potential for "unattended upgrades" to create synchronized failures across supposedly isolated environments.
The systemd-networkd Routing Change
- In December 2020, systemd version 248 introduced a change where
systemd-networkdflushes all IP routing rules it does not recognize upon startup. - Version 249 introduced the
ManageForeignRoutingPolicyRulessetting, which defaults to "yes," confirming this management behavior for any rules not explicitly defined in systemd configuration files. - These changes were backported to earlier versions (v247 and v248) but were notably absent from v245, the version used in Ubuntu 20.04.
Dormant Risks in the Ubuntu 22.04 Migration
- Datadog began migrating its fleet from Ubuntu 20.04 to 22.04 in late 2022, eventually reaching 90% coverage across its infrastructure.
- Ubuntu 22.04 utilizes systemd v249, meaning the majority of the fleet was susceptible to the routing rule flushing behavior.
- The risk remained dormant during the initial rollout because
systemd-networkdtypically only starts during the initial boot sequence when no complex routing rules have been established yet.
The Trigger: Unattended Upgrades and the CVE Patch
- On March 7, 2023, a security patch for a systemd CVE was released to the Ubuntu security repositories.
- Datadog’s fleet used the Ubuntu default configuration for
unattended-upgrades, which automatically installs security-labeled patches once a day, typically between 06:00 and 07:00 UTC. - The installation of the patch forced a restart of the
systemd-networkdservice on active, running nodes. - Upon restarting, the service identified existing IP routing rules (crucial for container networking) as "foreign" and deleted them, effectively severing network connectivity for the nodes.
Failure of Regional Isolation
- Because the security patch was released globally and the automated upgrade window was synchronized across regions, the failure occurred nearly simultaneously worldwide.
- This automation bypassed Datadog’s standard practice of "baking" changes in staging and experimental clusters for weeks before proceeding to production.
- Nodes on the older Ubuntu 20.04 (systemd v245) were unaffected by the patch, as that version of systemd does not flush IP rules upon a service restart.
To mitigate similar risks, infrastructure teams should consider explicitly disabling the management of foreign routing rules in systemd-networkd configuration when using third-party networking plugins. Furthermore, while automated security patching is a best practice, organizations must balance the speed of patching with the need for controlled, staged rollouts to prevent global configuration drift or synchronized failures.