Datadog / python

2 posts

Our journey taking Kubernetes state metrics to the next level | Datadog (opens in new tab)

Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes. ### Challenges with KSM Scalability * KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size. * In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics. * In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes. * The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity. ### Limitations of Legacy Implementations * KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation. * The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time. * To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments). * This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization. ### Architectural Improvements in KSM v2.0 * Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design. * The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time. * By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins. * The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced. Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.

Cheering on coworkers: Building culture with Datadog dashboards | Datadog (opens in new tab)

Datadog engineers developed a real-time tracking dashboard to monitor a colleague’s progress during an 850km, six-day ultra-marathon challenge. By scraping public race statistics and piping the data into their monitoring platform, the team created a centralized visualization tool to provide remote support and office-wide engagement. ### Data Extraction and Parsing The team needed to harvest race data that was only available as plain HTML on the event’s official website. * A crawler was built using the Python `Requests` library to automate the retrieval of the webpage's source code. * The team utilized `BeautifulSoup` to parse the HTML and isolate specific data points, such as the runner's current ranking and total distance covered. ### Ingesting Metrics with StatsD Once the data was structured, it was converted into telemetry using the Datadog agent and the `statsd` Python library. * The script utilized `dog.gauge` to emit three primary metrics: `runner.distance`, `runner.ranking`, and `runner.elapsed_time`. * Each metric was assigned a "name" tag corresponding to the runner, allowing the team to filter data and compare participants within the Datadog interface. * The data was updated periodically to ensure the dashboard reflected the most current race standings. ### Dashboard Visualization and Results The final phase involved synthesizing the metrics into a high-visibility dashboard displayed in the company’s New York and Paris offices. * The dashboard combined technical performance graphs with multimedia elements, including live video feeds and GIFs, to create an interactive cheering station. * The system successfully tracked the athlete's 47km lead in real-time, providing the team with immediate updates on his physical progress and elapsed time over the 144-hour event. This project demonstrates how standard observability tools can be repurposed for creative "life-graphing" applications. By combining simple web scraping with metric ingestion, engineers can quickly build custom monitoring solutions for any public data source.