dynamic-program-analysis

datadog Sep 30, 2021

How we optimized our Akka application using Datadog’s Continuous Profiler (opens in new tab)

Vladimir Zhuk Performance bottlenecks are not always (or some might say, never) where you expect them. We have all been there, knowing that there was a latency, but not finding it in any of the expected places. There is nothing worse than seeing that there's a latency and having…

dynamic-program-analysis database-design java profiling+4

datadog Sep 30, 2021

How we optimized our Akka application using Datadog’s Continuous Profiler | Datadog (opens in new tab)

Datadog engineers discovered a significant 20–30% CPU overhead in their Akka-based Java applications caused by inefficient thread management within the `ForkJoinPool`. Through continuous profiling, the team found that irregular task flows were forcing the runtime to waste cycles constantly parking and unparking threads. By migrating bursty actors to a dispatcher with a more stable workload, they achieved a major performance gain, illustrating how high-level framework abstractions can mask low-level resource bottlenecks. ### Identifying the Performance Bottleneck * While running A/B tests on a new log-parsing algorithm, the team noticed that expected CPU reductions did not materialize; in some cases, performance actually degraded. * Flame graphs revealed that the application was spending a disproportionate amount of CPU time inside the `ForkJoinPool.scan()` and `Unsafe.park()` methods. * A summary table of CPU usage by thread showed that the "work" pool was only using 1% of the CPU, while the default Akka dispatcher was the primary consumer of resources. * The investigation narrowed the cause down to the `LatencyReportActor`, which handled latency metrics for log events. ### Analyzing the Root Cause of Thread Fluctuations * The `ForkJoinPool` manages worker threads dynamically, calling `Unsafe.park()` to suspend idle threads and `Unsafe.unpark()` to resume them when tasks increase. * The `LatencyReportActor` exhibited an irregular task flow, processing several hundred events in milliseconds and then remaining idle until the next second. * Because the default dispatcher was configured to use a thread pool equal to the number of processor cores (32), the system was waking up 32 threads every second for a tiny burst of work. * This constant cycle of waking and suspending threads created massive CPU overhead through expensive native calls to the operating system's thread scheduler. ### Implementing a Configuration-Based Fix * The solution involved moving the `LatencyReportActor` from the default Akka dispatcher to the main "work" dispatcher. * Because the "work" dispatcher already maintained a consistent flow of log processing tasks, the threads remained active and did not trigger the frequent park/unpark logic. * A single-line configuration change was used to route the actor to the stable dispatcher. * Following the change, the default dispatcher’s thread pool shrank from 32 to 2 threads, and overall service CPU usage dropped by an average of 30%. To maintain optimal performance in applications using `ForkJoinPool` or Akka, developers should monitor the `ForkJoinPool.scan()` method; if it accounts for more than 10–15% of CPU usage, the thread pool is likely unstable. Recommendations for remediation include limiting the number of actor instances, capping the maximum threads in a pool, and utilizing task queues to buffer short spikes. The ultimate goal is to ensure a stable count of active threads and avoid the performance tax of frequent thread state transitions.

dynamic-program-analysis datadog java scala+4