Introducing new FigJam prices and a more open platform | Figma Blog (opens in new tab)
Introducing FigJam Inside Figma Design Product updates Collaboration FigJam News
Introducing FigJam Inside Figma Design Product updates Collaboration FigJam News
Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes. ### Challenges with KSM Scalability * KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size. * In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics. * In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes. * The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity. ### Limitations of Legacy Implementations * KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation. * The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time. * To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments). * This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization. ### Architectural Improvements in KSM v2.0 * Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design. * The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time. * By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins. * The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced. Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.
Charly Fontaine Cedric Lamoriniere Ahmed Mezghani We contributed to the kube-state-metrics, a popular open source Kubernetes service that listens to the Kubernetes API server and generates metrics about the state of the objects. It focuses on monitoring the health of deployments…
Behind the feature: The hidden challenges of autosave Inside Figma Product updates Engineering Behind the scenes
Introducing branching: space to iterate and explore freely Inside Figma Product updates Figma Design News
Vladimir Zhuk Performance bottlenecks are not always (or some might say, never) where you expect them. We have all been there, knowing that there was a latency, but not finding it in any of the expected places. There is nothing worse than seeing that there's a latency and having…
Datadog engineers discovered a significant 20–30% CPU overhead in their Akka-based Java applications caused by inefficient thread management within the `ForkJoinPool`. Through continuous profiling, the team found that irregular task flows were forcing the runtime to waste cycles constantly parking and unparking threads. By migrating bursty actors to a dispatcher with a more stable workload, they achieved a major performance gain, illustrating how high-level framework abstractions can mask low-level resource bottlenecks. ### Identifying the Performance Bottleneck * While running A/B tests on a new log-parsing algorithm, the team noticed that expected CPU reductions did not materialize; in some cases, performance actually degraded. * Flame graphs revealed that the application was spending a disproportionate amount of CPU time inside the `ForkJoinPool.scan()` and `Unsafe.park()` methods. * A summary table of CPU usage by thread showed that the "work" pool was only using 1% of the CPU, while the default Akka dispatcher was the primary consumer of resources. * The investigation narrowed the cause down to the `LatencyReportActor`, which handled latency metrics for log events. ### Analyzing the Root Cause of Thread Fluctuations * The `ForkJoinPool` manages worker threads dynamically, calling `Unsafe.park()` to suspend idle threads and `Unsafe.unpark()` to resume them when tasks increase. * The `LatencyReportActor` exhibited an irregular task flow, processing several hundred events in milliseconds and then remaining idle until the next second. * Because the default dispatcher was configured to use a thread pool equal to the number of processor cores (32), the system was waking up 32 threads every second for a tiny burst of work. * This constant cycle of waking and suspending threads created massive CPU overhead through expensive native calls to the operating system's thread scheduler. ### Implementing a Configuration-Based Fix * The solution involved moving the `LatencyReportActor` from the default Akka dispatcher to the main "work" dispatcher. * Because the "work" dispatcher already maintained a consistent flow of log processing tasks, the threads remained active and did not trigger the frequent park/unpark logic. * A single-line configuration change was used to route the actor to the stable dispatcher. * Following the change, the default dispatcher’s thread pool shrank from 32 to 2 threads, and overall service CPU usage dropped by an average of 30%. To maintain optimal performance in applications using `ForkJoinPool` or Akka, developers should monitor the `ForkJoinPool.scan()` method; if it accounts for more than 10–15% of CPU usage, the thread pool is likely unstable. Recommendations for remediation include limiting the number of actor instances, capping the maximum threads in a pool, and utilizing task queues to buffer short spikes. The ultimate goal is to ensure a stable count of active threads and avoid the performance tax of frequent thread state transitions.
What’s new in Figma: August 2021 Inside Figma Product updates Plugins & tooling Design systems
Redesigning Dropbox’s ways of working Maker Stories Case study Figma Design
Reflections on Config, our first user conference Inside Figma Config Events News
Securing internal web apps Inside Figma Security Engineering
Introducing FigJam Inside Figma Design Product updates Collaboration FigJam News
FIT’s principles for fostering a collaborative classroom Best practices from professors at the Fashion Institute of Technology on how to construct a virtual classroom in Figma. Working Well Career & education Leadership Case study Collaboration
FIT’s principles for fostering a collaborative classroom Best practices from professors at the Fashion Institute of Technology on how to construct a virtual classroom in Figma. Working Well Career & education Leadership Case study Collaboration
Rethinking a design thinking workshop for good How Design the Future used FigJam to bring one hundred high school students together to solve challenges for people with disabilities. Maker Stories Social impact Leadership Case study Design thinking Meetings