metrics-collection

2 posts

naver

Naver TV (opens in new tab)

This technical session from NAVER ENGINEERING DAY 2025 details the transition from traditional open-source exporters to a Telegraf-based architecture for collecting custom system metrics. By evaluating various monitoring tools through rigorous benchmarking, the developers demonstrate how Telegraf provides a more flexible and high-performance framework for infrastructure observability. The presentation concludes that adopting Telegraf streamlines the metric collection pipeline and offers superior scalability for complex, large-scale service environments. ### Context and Motivation for Open-Source Exporters * The project originated from the need to overcome the limitations of standard open-source exporters that lacked support for specific internal business logic. * Engineers sought a unified way to collect diverse data points without managing dozens of fragmented, single-purpose agents. * The primary goal was to find a solution that could handle high-frequency data ingestion while maintaining low resource overhead on production servers. ### Benchmark Testing for Metric Collection * A comparative analysis was conducted between several open-source monitoring agents to determine their efficiency under load. * Testing focused on critical performance indicators, including CPU and memory footprint during peak metric throughput. * The results highlighted Telegraf's stability and consistent performance compared to other exporter-based alternatives, leading to its selection as the primary collection tool. ### Telegraf Architecture and Customization * Telegraf operates as a plugin-driven agent, utilizing four distinct categories: Input, Processor, Aggregator, and Output plugins. * The development team shared their experience writing custom exporters by leveraging Telegraf’s modular Go-based framework. * This approach allowed for the seamless transformation of raw data into various formats (such as Prometheus or InfluxDB) using a single, unified configuration. ### Operational Gains and Technical Options * Post-implementation, the system saw a significant reduction in operational complexity by consolidating various metric streams into a single agent. * Specific Telegraf options were utilized to fine-tune the collection interval and batch size, optimizing the balance between data granularity and network load. * The migration improved the reliability of metric delivery through built-in retry mechanisms and internal buffers that prevent data loss during transient network failures. For teams currently managing a sprawling array of open-source exporters, migrating to a Telegraf-based architecture is recommended to centralize metric collection. The plugin-based system not only reduces the maintenance burden but also provides the necessary extensibility to support specialized custom metrics as service requirements evolve.

datadog

Our journey taking Kubernetes state metrics to the next level | Datadog (opens in new tab)

Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes. ### Challenges with KSM Scalability * KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size. * In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics. * In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes. * The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity. ### Limitations of Legacy Implementations * KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation. * The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time. * To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments). * This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization. ### Architectural Improvements in KSM v2.0 * Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design. * The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time. * By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins. * The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced. Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.