Datadog / database-design

5 posts

2023-03-08 incident: A deep dive into the platform-level recovery | Datadog (opens in new tab)

Following a massive system-wide outage in March 2023, Datadog successfully restored its EU1 region by identifying that a simple node reboot could resolve network connectivity issues caused by a faulty system patch. While the team managed to restore 100 percent of compute capacity within hours, the recovery effort was subsequently hindered by cloud provider infrastructure limits and IP address exhaustion. This post-mortem highlights the complexities of scaling hierarchical Kubernetes environments under extreme pressure and the importance of accounting for "black swan" capacity requirements. ## Hierarchical Kubernetes Recovery Datadog utilizes a strict hierarchy of Kubernetes clusters to manage its infrastructure, which necessitated a granular, three-tiered recovery approach. Because the outage affected network connectivity via `systemd-networkd`, the team had to restore components in a specific order to regain control of the environment. * **Parent Control Planes:** Engineers first rebooted the virtual machines hosting the parent clusters, which manage the control planes for all other clusters. * **Child Control Planes:** Once parent clusters were stable, the team restored the control planes for application clusters, which run as pods within the parent infrastructure. * **Application Worker Nodes:** Thousands of worker nodes across dozens of clusters were restarted progressively to avoid overwhelming the control planes, reaching full capacity by 12:05 UTC. ## Scaling Bottlenecks and Cloud Quotas Once the infrastructure was online, the team attempted to scale out rapidly to process a massive backlog of buffered data. This surge in demand triggered previously unencountered limitations within the Google Cloud environment. * **VPC Peering Limits:** At 14:18 UTC, the platform hit a documented but overlooked limit of 15,500 VM instances within a single network peering group, blocking all further scaling. * **Provider Intervention:** Datadog worked directly with Google Cloud support to manually raise the peering group limit, which allowed scaling to resume after a nearly four-hour delay. ## IP Address and Subnet Capacity Even after cloud-level instance quotas were lifted, specific high-traffic clusters processing logs and traces hit a secondary bottleneck related to internal networking. * **Subnet Exhaustion:** These clusters attempted to scale to more than twice their normal size, quickly exhausting all available IP addresses in their assigned subnets. * **Capacity Planning Gaps:** While Datadog typically targets a 66% maximum IP usage to allow for a 50% scale-out, the extreme demands of the recovery backlog exceeded these safety margins. * **Impact on Backlog:** For six hours, the lack of available IPs forced these clusters to process data significantly slower than the rest of the recovered infrastructure. ## Recovery Summary The EU1 recovery demonstrates that even when hardware is functional, software-defined limits can create cascading delays. Organizations should not only monitor their own resource usage but also maintain visibility into cloud provider quotas and ensure that subnet allocations account for extreme recovery scenarios where workloads may need to double or triple in size momentarily.

Our journey taking Kubernetes state metrics to the next level | Datadog (opens in new tab)

Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes. ### Challenges with KSM Scalability * KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size. * In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics. * In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes. * The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity. ### Limitations of Legacy Implementations * KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation. * The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time. * To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments). * This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization. ### Architectural Improvements in KSM v2.0 * Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design. * The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time. * By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins. * The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced. Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.

What product designers can learn from explanatory journalism | Datadog (opens in new tab)

Product designers can significantly improve their impact by adopting the techniques of explanatory journalism, which prioritizes deep context over the constant noise of new information. By shifting the focus from simply presenting features to explaining the "why" and "how" behind them, designers can better navigate the complex needs of various stakeholders. This approach fosters more rigorous decision-making and ensures that product solutions are grounded in a comprehensive understanding of the problem space. ### Prioritizing Impact Over Recency * Designers often face a "newness bias" where the latest support ticket or customer call carries disproportionate weight compared to long-term goals. * To counteract this, designers should aggregate feedback from diverse sources—such as high-value customers and recurring requests—to identify and prioritize what is truly important rather than what is merely recent. * Effective prioritization requires a centralized system to track the frequency and source of feedback, allowing for a more objective weighting of product requirements. ### Mitigating Context Collapse * In a large organization, "context collapse" occurs when information is shared across different teams (Sales, Support, Research, Executives) without accounting for their unique perspectives or goals. * A designer's role involves assembling disparate pieces of data—including interview notes, sales requirements, and executive goals—into a single, cohesive narrative. * Beyond just presenting work, designers must frame their solutions specifically for each audience, explaining how the design addresses their specific context or why certain requests were triaged out. ### Leveraging the Unlimited Design Papertrail * The design process should cycle through "expansion," where research and data are gathered without space constraints, and "contraction," where that information is distilled into actionable insights. * Developing a thorough "papertrail" of documentation helps the designer master the subject matter, making their eventual summaries more concise and authoritative. * This documentation should include organized interview notes—categorized by job role and company size—and competitive research to serve as a permanent "canon" for all design decisions. To produce more effective work, designers should embrace the role of an "explainer" by meticulously documenting their research and expansion phases. Building a robust, updated papertrail not only clarifies the designer's own thinking but also provides the necessary evidence to defend usability and interaction design choices in a fast-moving product environment.

How Datadog uses Datadog to gain visibility into the Datadog user experience | Datadog (opens in new tab)

Datadog leverages its own monitoring tools to bridge the gap between qualitative user interviews and quantitative performance data. By "dogfooding" features like Real User Monitoring (RUM) and Logs, the product design team makes evidence-based UI/UX adjustments while gaining firsthand empathy for the user experience. This approach allows them to identify exactly how users interact with specific components and where current designs fail to meet user expectations. **Optimizing Font Consistency via CSS API Tracking** * To ensure visual precision in information-dense views like the Log Explorer, the team needed to transition from a generic system font stack to a standardized monospace font. * Designers used the Web API’s `Document.font` interface and the CSS Font Loading API via Datadog RUM to collect data on which specific fonts were actually being rendered on users' machines. * By analyzing a dashboard of these results, the team selected Roboto Mono as the standard, ensuring the new font’s optical size matched what the plurality of users were already seeing to avoid breaking embedded tables. **Simplifying Components through Interaction Logging** * The `DraggablePane` component, used for resizing adjacent panels, was suffering from UI clutter due to physical buttons for minimizing and maximizing content. * The team implemented custom loggers within Datadog Logs to track how frequently users clicked these specific controls versus interacting with the draggable handle. * The data revealed that the buttons were almost never used; consequently, the team removed them and replaced the functionality with a double-click event, significantly streamlining the interface. **Refining Syntax Support through Error Analysis** * When introducing the `DateRangePicker` for custom time frames, the team needed to expand the component's logic to support natural language strings. * By aggregating "invalid inputs" in Datadog Logs, the team could see the exact strings users were typing—such as "last 2 weeks"—that the system failed to parse. * Analyzing these common patterns allowed the team to update the parsing logic for high-demand keywords, which resulted in the component’s error rate dropping from 10 percent to approximately 5 percent. Leveraging internal monitoring tools allows design teams to move beyond guesswork and create highly functional interfaces. For organizations managing complex technical products, tracking specific component failures and interaction frequencies is an essential strategy for prioritizing the design roadmap and improving user retention.

Performance improvements in the Datadog Agent metrics pipeline | Datadog (opens in new tab)

Datadog engineers recently optimized the Datadog Agent's metric processing pipeline to achieve higher throughput and lower CPU overhead. By identifying that metric context generation—the process of creating unique keys for metrics—was a primary bottleneck, they implemented a series of algorithmic changes and Go runtime optimizations. These improvements allow the Agent to process significantly more metrics using the same computational resources. ### Identifying Bottlenecks via CPU Profiling * Developers utilized Go’s native profiling tools to capture CPU usage during high-volume metric ingestion via DogStatsD. * Flamegraph analysis revealed that the `addSample` and `trackContext` functions were the most CPU-intensive components of the pipeline. * The profiling data specifically pointed to tag sorting and deduplication as the underlying operations consuming the most processing time. ### The Challenges of Metric Context Generation * The Agent must generate a unique hash (context) for every metric received to address it within a hash table in RAM. * To ensure the same metric always generates the same key, the original algorithm required sorting all tags and ensuring their uniqueness. * The computational cost of sorting lists repeatedly for every incoming message created a performance ceiling for the entire metrics pipeline. ### Specialization and Runtime Optimization * **Algorithmic Specialization:** The team implemented specialized sorting logic that adjusts based on the number of tags, optimizing the "hot path" for the most common metric structures. * **Hashing Efficiency:** Micro-benchmarks identified Murmur3 as the most efficient hash implementation for balancing speed and collision resistance in this use case. * **Leveraging Go Runtime:** The team transitioned from 128-bit hashes to 64-bit metric contexts. This change allowed the Agent to utilize Go's internal `mapassign_fast64` and `mapaccess2_fast64` functions, which provide optimized map operations for 64-bit keys. ### Redesigning for Performance * The original design followed a rigid "hash metric name -> sort tags -> deduplicate tags -> iterative hash" workflow. * Recognizing that sorting was the primary architectural bottleneck, the team moved toward a new design intended to minimize or eliminate the overhead of traditional list sorting during context generation. To achieve similar performance gains in high-throughput Go applications, developers should profile their applications under realistic load and look for opportunities to leverage runtime-specific optimizations, such as using 64-bit map keys to trigger specialized compiler paths.