2023-03-08 incident: A deep dive into the platform-level recovery | Datadog (opens in new tab)
Following a massive system-wide outage in March 2023, Datadog successfully restored its EU1 region by identifying that a simple node reboot could resolve network connectivity issues caused by a faulty system patch. While the team managed to restore 100 percent of compute capacity within hours, the recovery effort was subsequently hindered by cloud provider infrastructure limits and IP address exhaustion. This post-mortem highlights the complexities of scaling hierarchical Kubernetes environments under extreme pressure and the importance of accounting for "black swan" capacity requirements. ## Hierarchical Kubernetes Recovery Datadog utilizes a strict hierarchy of Kubernetes clusters to manage its infrastructure, which necessitated a granular, three-tiered recovery approach. Because the outage affected network connectivity via `systemd-networkd`, the team had to restore components in a specific order to regain control of the environment. * **Parent Control Planes:** Engineers first rebooted the virtual machines hosting the parent clusters, which manage the control planes for all other clusters. * **Child Control Planes:** Once parent clusters were stable, the team restored the control planes for application clusters, which run as pods within the parent infrastructure. * **Application Worker Nodes:** Thousands of worker nodes across dozens of clusters were restarted progressively to avoid overwhelming the control planes, reaching full capacity by 12:05 UTC. ## Scaling Bottlenecks and Cloud Quotas Once the infrastructure was online, the team attempted to scale out rapidly to process a massive backlog of buffered data. This surge in demand triggered previously unencountered limitations within the Google Cloud environment. * **VPC Peering Limits:** At 14:18 UTC, the platform hit a documented but overlooked limit of 15,500 VM instances within a single network peering group, blocking all further scaling. * **Provider Intervention:** Datadog worked directly with Google Cloud support to manually raise the peering group limit, which allowed scaling to resume after a nearly four-hour delay. ## IP Address and Subnet Capacity Even after cloud-level instance quotas were lifted, specific high-traffic clusters processing logs and traces hit a secondary bottleneck related to internal networking. * **Subnet Exhaustion:** These clusters attempted to scale to more than twice their normal size, quickly exhausting all available IP addresses in their assigned subnets. * **Capacity Planning Gaps:** While Datadog typically targets a 66% maximum IP usage to allow for a 50% scale-out, the extreme demands of the recovery backlog exceeded these safety margins. * **Impact on Backlog:** For six hours, the lack of available IPs forced these clusters to process data significantly slower than the rest of the recovered infrastructure. ## Recovery Summary The EU1 recovery demonstrates that even when hardware is functional, software-defined limits can create cascading delays. Organizations should not only monitor their own resource usage but also maintain visibility into cloud provider quotas and ensure that subnet allocations account for extreme recovery scenarios where workloads may need to double or triple in size momentarily.