latency-reporting

1 posts

datadog

How we optimized our Akka application using Datadog’s Continuous Profiler | Datadog (opens in new tab)

Datadog engineers discovered a significant 20–30% CPU overhead in their Akka-based Java applications caused by inefficient thread management within the `ForkJoinPool`. Through continuous profiling, the team found that irregular task flows were forcing the runtime to waste cycles constantly parking and unparking threads. By migrating bursty actors to a dispatcher with a more stable workload, they achieved a major performance gain, illustrating how high-level framework abstractions can mask low-level resource bottlenecks. ### Identifying the Performance Bottleneck * While running A/B tests on a new log-parsing algorithm, the team noticed that expected CPU reductions did not materialize; in some cases, performance actually degraded. * Flame graphs revealed that the application was spending a disproportionate amount of CPU time inside the `ForkJoinPool.scan()` and `Unsafe.park()` methods. * A summary table of CPU usage by thread showed that the "work" pool was only using 1% of the CPU, while the default Akka dispatcher was the primary consumer of resources. * The investigation narrowed the cause down to the `LatencyReportActor`, which handled latency metrics for log events. ### Analyzing the Root Cause of Thread Fluctuations * The `ForkJoinPool` manages worker threads dynamically, calling `Unsafe.park()` to suspend idle threads and `Unsafe.unpark()` to resume them when tasks increase. * The `LatencyReportActor` exhibited an irregular task flow, processing several hundred events in milliseconds and then remaining idle until the next second. * Because the default dispatcher was configured to use a thread pool equal to the number of processor cores (32), the system was waking up 32 threads every second for a tiny burst of work. * This constant cycle of waking and suspending threads created massive CPU overhead through expensive native calls to the operating system's thread scheduler. ### Implementing a Configuration-Based Fix * The solution involved moving the `LatencyReportActor` from the default Akka dispatcher to the main "work" dispatcher. * Because the "work" dispatcher already maintained a consistent flow of log processing tasks, the threads remained active and did not trigger the frequent park/unpark logic. * A single-line configuration change was used to route the actor to the stable dispatcher. * Following the change, the default dispatcher’s thread pool shrank from 32 to 2 threads, and overall service CPU usage dropped by an average of 30%. To maintain optimal performance in applications using `ForkJoinPool` or Akka, developers should monitor the `ForkJoinPool.scan()` method; if it accounts for more than 10–15% of CPU usage, the thread pool is likely unstable. Recommendations for remediation include limiting the number of actor instances, capping the maximum threads in a pool, and utilizing task queues to buffer short spikes. The ultimate goal is to ensure a stable count of active threads and avoid the performance tax of frequent thread state transitions.