AWS Interconnect is now generally available, with a new option to simplify last-mile connectivity Today, we’re announcing the general availability of AWS Interconnect – multicloud, a managed private connectivity service that connects your Amazon Virtual Private Cloud (Amazon VPC…
Laurent Bernaille On March 8, 2023, Datadog experienced an outage that affected all services across multiple regions. In a separate blog post we describe how our products were impacted. This event was highly unusual from the start. An outage that affects services across multiple…
The March 2023 Datadog outage was triggered by a simultaneous, global failure across multiple cloud providers and regions, caused by an unexpected interaction between a systemd security patch and Ubuntu 22.04’s default networking behavior. While Datadog typically employs rigorous, staged rollouts for infrastructure changes, the automated nature of OS-level security updates bypassed these controls. The incident highlights the hidden risks in system-level defaults and the potential for "unattended upgrades" to create synchronized failures across supposedly isolated environments.
## The systemd-networkd Routing Change
* In December 2020, systemd version 248 introduced a change where `systemd-networkd` flushes all IP routing rules it does not recognize upon startup.
* Version 249 introduced the `ManageForeignRoutingPolicyRules` setting, which defaults to "yes," confirming this management behavior for any rules not explicitly defined in systemd configuration files.
* These changes were backported to earlier versions (v247 and v248) but were notably absent from v245, the version used in Ubuntu 20.04.
## Dormant Risks in the Ubuntu 22.04 Migration
* Datadog began migrating its fleet from Ubuntu 20.04 to 22.04 in late 2022, eventually reaching 90% coverage across its infrastructure.
* Ubuntu 22.04 utilizes systemd v249, meaning the majority of the fleet was susceptible to the routing rule flushing behavior.
* The risk remained dormant during the initial rollout because `systemd-networkd` typically only starts during the initial boot sequence when no complex routing rules have been established yet.
## The Trigger: Unattended Upgrades and the CVE Patch
* On March 7, 2023, a security patch for a systemd CVE was released to the Ubuntu security repositories.
* Datadog’s fleet used the Ubuntu default configuration for `unattended-upgrades`, which automatically installs security-labeled patches once a day, typically between 06:00 and 07:00 UTC.
* The installation of the patch forced a restart of the `systemd-networkd` service on active, running nodes.
* Upon restarting, the service identified existing IP routing rules (crucial for container networking) as "foreign" and deleted them, effectively severing network connectivity for the nodes.
## Failure of Regional Isolation
* Because the security patch was released globally and the automated upgrade window was synchronized across regions, the failure occurred nearly simultaneously worldwide.
* This automation bypassed Datadog’s standard practice of "baking" changes in staging and experimental clusters for weeks before proceeding to production.
* Nodes on the older Ubuntu 20.04 (systemd v245) were unaffected by the patch, as that version of systemd does not flush IP rules upon a service restart.
To mitigate similar risks, infrastructure teams should consider explicitly disabling the management of foreign routing rules in systemd-networkd configuration when using third-party networking plugins. Furthermore, while automated security patching is a best practice, organizations must balance the speed of patching with the need for controlled, staged rollouts to prevent global configuration drift or synchronized failures.