datadog

How we optimized our Akka application using Datadog’s Continuous Profiler | Datadog (opens in new tab)

datadog java akka fork-join-pool scala concurrent-programming dynamic-program-analysis latency-reporting

Datadog engineers discovered a significant 20–30% CPU overhead in their Akka-based Java applications caused by inefficient thread management within the ForkJoinPool. Through continuous profiling, the team found that irregular task flows were forcing the runtime to waste cycles constantly parking and unparking threads. By migrating bursty actors to a dispatcher with a more stable workload, they achieved a major performance gain, illustrating how high-level framework abstractions can mask low-level resource bottlenecks.

Identifying the Performance Bottleneck

While running A/B tests on a new log-parsing algorithm, the team noticed that expected CPU reductions did not materialize; in some cases, performance actually degraded.
Flame graphs revealed that the application was spending a disproportionate amount of CPU time inside the ForkJoinPool.scan() and Unsafe.park() methods.
A summary table of CPU usage by thread showed that the "work" pool was only using 1% of the CPU, while the default Akka dispatcher was the primary consumer of resources.
The investigation narrowed the cause down to the LatencyReportActor, which handled latency metrics for log events.

Analyzing the Root Cause of Thread Fluctuations

The ForkJoinPool manages worker threads dynamically, calling Unsafe.park() to suspend idle threads and Unsafe.unpark() to resume them when tasks increase.
The LatencyReportActor exhibited an irregular task flow, processing several hundred events in milliseconds and then remaining idle until the next second.
Because the default dispatcher was configured to use a thread pool equal to the number of processor cores (32), the system was waking up 32 threads every second for a tiny burst of work.
This constant cycle of waking and suspending threads created massive CPU overhead through expensive native calls to the operating system's thread scheduler.

Implementing a Configuration-Based Fix

The solution involved moving the LatencyReportActor from the default Akka dispatcher to the main "work" dispatcher.
Because the "work" dispatcher already maintained a consistent flow of log processing tasks, the threads remained active and did not trigger the frequent park/unpark logic.
A single-line configuration change was used to route the actor to the stable dispatcher.
Following the change, the default dispatcher’s thread pool shrank from 32 to 2 threads, and overall service CPU usage dropped by an average of 30%.

To maintain optimal performance in applications using ForkJoinPool or Akka, developers should monitor the ForkJoinPool.scan() method; if it accounts for more than 10–15% of CPU usage, the thread pool is likely unstable. Recommendations for remediation include limiting the number of actor instances, capping the maximum threads in a pool, and utilizing task queues to buffer short spikes. The ultimate goal is to ensure a stable count of active threads and avoid the performance tax of frequent thread state transitions.