Breaking up a monolith: How we’re unwinding a shared database at scale | Datadog (opens in new tab)
Debug PostgreSQL query latency faster with EXPLAIN ANALYZE in Datadog Database Monitoring
88 posts
Debug PostgreSQL query latency faster with EXPLAIN ANALYZE in Datadog Database Monitoring
Cilium configuration for Kubernetes operations at scale
Following a massive system-wide outage in March 2023, Datadog successfully restored its EU1 region by identifying that a simple node reboot could resolve network connectivity issues caused by a faulty system patch. While the team managed to restore 100 percent of compute capacity within hours, the recovery effort was subsequently hindered by cloud provider infrastructure limits and IP address exhaustion. This post-mortem highlights the complexities of scaling hierarchical Kubernetes environments under extreme pressure and the importance of accounting for "black swan" capacity requirements. ## Hierarchical Kubernetes Recovery Datadog utilizes a strict hierarchy of Kubernetes clusters to manage its infrastructure, which necessitated a granular, three-tiered recovery approach. Because the outage affected network connectivity via `systemd-networkd`, the team had to restore components in a specific order to regain control of the environment. * **Parent Control Planes:** Engineers first rebooted the virtual machines hosting the parent clusters, which manage the control planes for all other clusters. * **Child Control Planes:** Once parent clusters were stable, the team restored the control planes for application clusters, which run as pods within the parent infrastructure. * **Application Worker Nodes:** Thousands of worker nodes across dozens of clusters were restarted progressively to avoid overwhelming the control planes, reaching full capacity by 12:05 UTC. ## Scaling Bottlenecks and Cloud Quotas Once the infrastructure was online, the team attempted to scale out rapidly to process a massive backlog of buffered data. This surge in demand triggered previously unencountered limitations within the Google Cloud environment. * **VPC Peering Limits:** At 14:18 UTC, the platform hit a documented but overlooked limit of 15,500 VM instances within a single network peering group, blocking all further scaling. * **Provider Intervention:** Datadog worked directly with Google Cloud support to manually raise the peering group limit, which allowed scaling to resume after a nearly four-hour delay. ## IP Address and Subnet Capacity Even after cloud-level instance quotas were lifted, specific high-traffic clusters processing logs and traces hit a secondary bottleneck related to internal networking. * **Subnet Exhaustion:** These clusters attempted to scale to more than twice their normal size, quickly exhausting all available IP addresses in their assigned subnets. * **Capacity Planning Gaps:** While Datadog typically targets a 66% maximum IP usage to allow for a 50% scale-out, the extreme demands of the recovery backlog exceeded these safety margins. * **Impact on Backlog:** For six hours, the lack of available IPs forced these clusters to process data significantly slower than the rest of the recovered infrastructure. ## Recovery Summary The EU1 recovery demonstrates that even when hardware is functional, software-defined limits can create cascading delays. Organizations should not only monitor their own resource usage but also maintain visibility into cloud provider quotas and ensure that subnet allocations account for extreme recovery scenarios where workloads may need to double or triple in size momentarily.
Cilium configuration for Kubernetes operations at scale
Managing Datadog with Terraform
Cilium configuration for Kubernetes operations at scale
Engineering VP spotlight: Ivo Dimitrov
Cilium configuration for Kubernetes operations at scale
Engineering VP spotlight: Ivo Dimitrov
Understanding data lineage
Route your monitor alerts with Datadog monitor notification rules
Patterns for safe and efficient cache purging in CI/CD pipelines
Following a major 2023 incident that caused a near-total platform outage despite partial infrastructure availability, Datadog shifted its engineering philosophy from "never-fail" architectures to a model of graceful degradation. The company identified that prioritizing absolute data correctness during systemic stress created "square-wave" failures, where the entire platform appeared down if even a portion of data was missing. By moving toward a "fail better" mindset, Datadog now focuses on maintaining core functionality and data persistence even when underlying infrastructure is compromised. ## Limitations of the Never-Fail Approach * Classical root-cause analysis focused on a legacy, unsupervised global update mechanism that disconnected 50–60% of production Kubernetes nodes. * While the "precipitating event" was easily identified and disabled, the engineering team realized that fixing the trigger did not address the systemic fragility that caused a binary (up/down) failure pattern. * Prioritizing absolute accuracy meant that systems would wait for all data tags to process before displaying results; under stress, this caused the UI to show no data at all rather than "almost correct" data. * Sequential queuing, aggressive retry logic, and node-specific processing requirements exacerbated the bottleneck, preventing real-time recovery. ## Prioritizing Graceful Degradation * The incident prompted a shift away from relying solely on redundancy to prevent outages, acknowledging that some level of failure is eventually inevitable at scale. * Engineering priorities were redefined to ensure that data is never lost (even if delayed) and that real-time data is processed before stale backlogs. * The platform now aims to serve partial-but-accurate results to customers during an incident, providing visibility rather than a complete blackout. * Implementation is handled as a company-wide program where individual product teams adapt these principles to their specific architectural needs. ## Strengthening Data Persistence at Intake * Analysis revealed that data was lost during the outage because it was stored in memory or on local disks before being replicated to persistent stores. * The original design favored low-latency responses by acknowledging receipt of data before it was fully replicated, making that data unrecoverable if the node failed. * Downstream failures caused intake nodes to overflow their local buffers, leading to data loss even on nodes that remained online. * New architectural changes focus on implementing disk-based persistence at the very beginning of the processing pipeline to ensure data survives node restarts and downstream congestion. To build truly resilient systems, engineering teams must move beyond trying to prevent every possible failure trigger. Instead, focus on designing services that can survive partial infrastructure loss by prioritizing data persistence and allowing for degraded states that still provide value to the end user.
Datadog’s Continuous Profiler timeline view addresses the challenge of diagnosing performance bottlenecks in production by providing a granular, time-sequenced visualization of code execution. By correlating thread activity with resource consumption, it enables engineers to move beyond high-level metrics and identify the exact lines of code responsible for latency spikes or CPU saturation. This visibility ensures that teams can optimize application performance and resolve complex runtime issues without the overhead of manual reproduction. ### Visualizing Thread Activity and CPU Utilization * The timeline view displays a breakdown of thread states, allowing developers to distinguish between "Running," "Runnable," "Blocked," and "Waiting" statuses. * By comparing wall time (total elapsed time) against CPU time (active processing), users can identify if a process is bottlenecked by intensive calculations or external dependencies. * Hovering over specific time slices reveals the associated stack traces, providing immediate context into which functions were active during a performance anomaly. ### Detecting Garbage Collection and Runtime Overhead * The profiler highlights runtime-specific events, such as Garbage Collection (GC) pauses, directly within the execution timeline. * This correlation allows teams to see if a spike in latency was caused by "Stop-the-World" events or inefficient memory allocation patterns that trigger frequent GC cycles. * By visualizing these events alongside application logic, engineers can determine whether to optimize their code or tune the underlying runtime configuration. ### Correlating Profiling Data with Distributed Traces * The timeline view integrates with Application Performance Monitoring (APM) to link specific slow traces to their corresponding profile data. * This "trace-to-profile" workflow allows developers to pivot from a high-latency request directly to the exact thread behavior occurring at that moment. * This integration eliminates guesswork when investigating "P99" latency outliers, as it shows exactly where time was spent—whether on lock contention, I/O wait, or complex algorithmic execution. ### Streamlining Production Troubleshooting * The tool enables a proactive approach to performance management by identifying "silent" inefficiencies that do not necessarily trigger errors but degrade the user experience. * Using the timeline view during post-mortem investigations provides a factual record of thread behavior, reducing the Mean Time to Resolution (MTTR) for intermittent production issues. For organizations running high-scale distributed systems, adopting a continuous profiling strategy with a focus on timeline analysis is recommended. This approach transforms observability from simple monitoring into a deep diagnostic capability, allowing for precise optimizations that lower infrastructure costs and improve application responsiveness.