Datadog / observability

2 posts

Robust statistical distances for machine learning | Datadog (opens in new tab)

Datadog has introduced Toto, a new open-weights foundation model specifically designed for time-series forecasting and anomaly detection within observability contexts. While general-purpose time-series models often struggle with the unique volatility and high-frequency patterns of IT telemetry, Toto is pre-trained on a massive dataset of 500 billion observations to provide superior zero-shot performance. This release, accompanied by the BOOM benchmark, addresses the critical need for specialized AI tools capable of handling the complexity of modern cloud infrastructure. ### Toto Model Architecture and Training * Toto utilizes a decoder-only transformer architecture, adapting large language model (LLM) principles to the domain of continuous numerical data. * The model employs a "patching" mechanism, which groups multiple time-series data points into single tokens to improve computational efficiency and allow the model to capture longer historical dependencies. * It incorporates Rotary Positional Embeddings (RoPE) to better handle sequences of varying lengths and maintain temporal relationships across different frequencies. * Training was conducted on a curated dataset of 500 billion anonymized data points from real-world observability metrics, including CPU usage, memory consumption, and network traffic. ### Specialized Observability Features * Unlike existing models like TimesFM or Chronos, which are trained on diverse but general datasets like weather or retail trends, Toto is optimized for the specific "spikiness" and abrupt level shifts common in IT environments. * The model supports zero-shot forecasting, allowing users to generate predictions for new metrics immediately without the need for expensive or time-consuming fine-tuning. * Toto is designed to handle varying sampling rates, from one-second intervals to hourly aggregations, making it versatile across different infrastructure layers. * The open-weights release on Hugging Face allows researchers and engineers to integrate the model into their own AIOps workflows or private cloud environments. ### The BOOM Evaluation Framework * Datadog released the Benchmarking Observability Models (BOOM) framework to provide a standardized method for evaluating time-series models on infrastructure-specific tasks. * BOOM focuses on metrics that represent real-world operational challenges, such as seasonal traffic patterns and sudden system failures. * Comparative testing shows that Toto consistently outperforms general-purpose models in Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) when applied to observability datasets. * The benchmark provides a transparent way for the industry to measure progress in time-series foundation models, moving beyond generic datasets that do not reflect the realities of microservices and distributed systems. Organizations looking to automate capacity planning, optimize cloud spend, or implement intelligent alerting should consider adopting Toto for their time-series analysis. By utilizing the open-weights model alongside the BOOM benchmark, teams can achieve high-accuracy forecasting and objective performance validation without the overhead of building specialized models from scratch.

Our journey taking Kubernetes state metrics to the next level | Datadog (opens in new tab)

Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes. ### Challenges with KSM Scalability * KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size. * In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics. * In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes. * The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity. ### Limitations of Legacy Implementations * KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation. * The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time. * To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments). * This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization. ### Architectural Improvements in KSM v2.0 * Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design. * The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time. * By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins. * The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced. Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.