query-scheduling

1 posts

datadog

Making fetch happen: Building a general-purpose query and render scheduler | Datadog (opens in new tab)

Datadog replaced its complex, dashboard-specific scheduling system with a generalized, modular query and render scheduler to improve performance across all its web applications. By simplifying query heuristics and leveraging the Browser Scheduling API for renders, the engineering team achieved a more stable backend load and smoother UI interactions. This transition transformed a brittle set of rules into a scalable framework that optimizes resource utilization based on widget visibility and browser availability. ## Limitations of Legacy Scheduling The original scheduling system was a complex web of over 20 interlinked heuristics that became difficult for developers to maintain or reason about. While it performed better than an unscheduled baseline, it suffered from several structural flaws: * **Tight Coupling:** Query and render logic were unnecessarily linked; for example, fetches were sometimes delayed based on pending render tasks, even when throttling fetches wasn’t necessary. * **Lack of Generalization:** The system was hardcoded specifically for dashboards, making it impossible to use the same optimization benefits for other widget-heavy products in the Datadog suite. * **Inefficient Resource Management:** Renders were often delayed based on arbitrary data size rules rather than the actual real-time availability of the browser's CPU and memory resources. ## A Simplified Query Algorithm To create a more predictable and efficient system, the team stripped away redundant rules—such as manual throttling for unfocused tabs, which modern browsers already handle—and moved to a streamlined query model. The new algorithm is governed by only six parameters: * **Visibility Priority:** Fetches for widgets currently visible in the viewport are executed immediately to ensure a responsive user experience. * **Fixed Time Windows:** Non-visible queries are ranked by enqueue time and processed in 2000ms windows with a limit of 10 tasks per window. * **Error Reduction:** The more stable distribution of tasks significantly reduced "429 (Too many requests)" errors, leading to faster overall data loading since fewer retries are required. * **Framework Integration:** This simplified logic was moved into a standard data-fetching framework, allowing any Datadog product using generalized components to benefit from the scheduler. ## Render Scheduling with the Browser Scheduling API While the query scheduler handles data fetching, a separate render scheduler manages the impact on the browser’s main thread. By moving away from legacy heuristics and adopting the Browser Scheduling API, Datadog can now schedule tasks based on native browser priorities: * **Prioritization:** The API allows developers to categorize tasks as `user-blocking`, `user-visible`, or `background`, ensuring the browser prioritizes critical UI updates while deferring heavy computations to idle periods. * **Resource Awareness:** Unlike the old system, this API is natively aware of CPU and memory pressure, allowing the browser to manage execution timing more effectively than a JavaScript-based heuristic. * **Future-Proofing:** Currently supported in Chromium and Firefox Nightly (with polyfills for others), this approach allows for mass updates to task priorities and the ability to abort stale tasks via `TaskController`. Standardizing on a modular scheduling architecture allows engineering teams to optimize both network traffic and main-thread performance without the maintenance overhead of complex, custom rule sets. For high-density data applications, leveraging native browser APIs for task prioritization is recommended to ensure smooth rendering across varying hardware capabilities.