Our journey taking Kubernetes state metrics to the next level | Datadog (opens in new tab)
Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes.
Challenges with KSM Scalability
- KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size.
- In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics.
- In clusters with thousands of nodes and tens of thousands of pods, the
/metricsendpoint produced payloads weighing tens of megabytes. - The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity.
Limitations of Legacy Implementations
- KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation.
- The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time.
- To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments).
- This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization.
Architectural Improvements in KSM v2.0
- Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design.
- The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time.
- By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins.
- The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced.
Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.