Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog (opens in new tab)
Laura de Vesine Rob Thomas Maciej Kowalewski In March 2023, Datadog experienced a rare, widespread incident that left large parts of our infrastructure only partially functional, but from a customer’s perspective, our platform looked completely down. This square-wave failure pat…