datadog

2023-03-08 incident: A deep dive into our incident response | Datadog (opens in new tab)

k8s automation datadog cloud-infrastructure monitoring incident-response systemd slack

Datadog’s first global outage on March 8, 2023, served as a rigorous stress test for their established incident response framework and "you build it, you own it" philosophy. While the outage was triggered by a systemic failure during a routine systemd upgrade, the company's commitment to blameless culture and decentralized engineering autonomy allowed hundreds of responders to coordinate a complex recovery across multiple regions. Ultimately, the event validated their investment in out-of-band monitoring and rigorous, bi-annual incident training as essential components for managing high-scale system disasters.

Incident Response Structure and Philosophy

Datadog employs a decentralized "you build it, you own it" model where individual engineering teams are responsible for the 24/7 health and monitoring of the services they build.
For high-severity incidents, a specialized rotation is paged, consisting of an Incident Commander to lead the response, a communications lead, and a customer liaison to manage external messaging.
The organization prioritizes "people over process," empowering engineers to use their judgment to find creative solutions rather than following rigid, pre-written playbooks that may not apply to unprecedented failures.
A blameless culture is strictly maintained across all levels of the company, ensuring that post-incident investigations focus on systemic improvements rather than assigning fault to individuals.

Multi-Layered Monitoring Strategy

Standard telemetry provides internal visibility, but Datadog also maintains "out-of-band" monitoring that operates completely outside its own infrastructure.
This out-of-band system interacts with Datadog APIs exactly like a customer would, ensuring that engineers are alerted even if the internal monitoring platform itself becomes unavailable.
Communication is streamlined through a dedicated Slack incident app that automatically generates coordination channels, providing situational awareness to any engineer who joins the effort.

Anatomy of the March 8 Outage

The outage began at 06:00 UTC, triggered by a systemd upgrade that caused widespread Kubernetes failures and prevented pods from restarting correctly.
The global nature of the outage was diagnosed within 32 minutes of the initial monitoring alerts, leading to the activation of executive on-calls and the customer support management team.
Responders identified "unattended upgrades" as the incident trigger approximately five and a half hours after the initial failure.
Recovery was executed in stages: compute capacity was restored first in the EU1 region, followed by the US1 region, with full infrastructure restoration completed by 19:00 UTC.

Organizations should treat incident response as a perishable skill that requires constant practice through a low threshold for declaring incidents and regular training. By combining out-of-band monitoring with a culture that empowers individual engineers to act autonomously during a crisis, teams can more effectively navigate the "not if, but when" reality of large-scale system failures.