scheduling-algorithms

2 posts

google

Solving virtual machine puzzles: How AI is optimizing cloud computing (opens in new tab)

Google researchers have developed LAVA, a scheduling framework designed to optimize virtual machine (VM) allocation in large-scale data centers by accurately predicting and adapting to VM lifespans. By moving beyond static, one-time predictions toward a "continuous re-prediction" model based on survival analysis, the system significantly improves resource efficiency and reduces fragmentation. This approach allows cloud providers to solve the complex "bin packing" problem more effectively, leading to better capacity utilization and easier system maintenance. ### The Challenge of Long-Tailed VM Distributions * Cloud workloads exhibit a extreme long-tailed distribution: while 88% of VMs live for less than an hour, these short-lived jobs consume only 2% of total resources. * The rare VMs that run for 30 days or longer account for a massive fraction of compute resources, meaning their placement has a disproportionate impact on host availability. * Poor allocation leads to "resource stranding," where a server's remaining capacity is too small or unbalanced to host new VMs, effectively wasting expensive hardware. * Traditional machine learning models that provide only a single prediction at VM creation are often fragile, as a single misprediction can block a physical host from being cleared for maintenance or new tasks. ### Continuous Re-prediction via Survival Analysis * Instead of predicting a single average lifetime, LAVA uses an ML model to generate a probability distribution of a VM's expected duration. * The system employs "continuous re-prediction," asking how much longer a VM is expected to run given how long it has already survived (e.g., a VM that has run for five days is assigned a different remaining lifespan than a brand-new one). * This adaptive approach allows the scheduling logic to automatically correct for initial mispredictions as more data about the VM's actual behavior becomes available over time. ### Novel Scheduling and Rescheduling Algorithms * **Non-Invasive Lifetime Aware Scheduling (NILAS):** Currently deployed on Google’s Borg cluster manager, this algorithm ranks potential hosts by grouping VMs with similar expected exit times to increase the frequency of "empty hosts" available for maintenance. * **Lifetime-Aware VM Allocation (LAVA):** This algorithm fills resource gaps on hosts containing long-lived VMs with jobs that are at least an order of magnitude shorter. This ensures the short-lived VMs exit quickly without extending the host's overall occupation time. * **Lifetime-Aware Rescheduling (LARS):** To minimize disruptions during defragmentation, LARS identifies and migrates the longest-lived VMs first while allowing short-lived VMs to finish their tasks naturally on the original host. By integrating survival-analysis-based predictions into the core logic of data center management, cloud providers can transition from reactive scheduling to a proactive model. This system not only maximizes resource density but also ensures that the physical infrastructure remains flexible enough to handle large, resource-intensive provisioning requests and essential system updates.

google

Load balancing with random job arrivals (opens in new tab)

Research from Google explores the competitive ratio of online load balancing when tasks arrive in a uniformly random order rather than an adversarial one. By analyzing a "tree balancing game" where edges must be oriented to minimize node indegree, the authors demonstrate that random arrival sequences still impose significant mathematical limitations on deterministic algorithms. The study ultimately concludes that no online algorithm can achieve a competitive ratio significantly better than $\sqrt{\log n}$, establishing new theoretical boundaries for efficient cluster management. ### The Online Load Balancing Challenge * Modern cluster management systems, such as Google’s Borg, must distribute hundreds of thousands of jobs across machines to maximize utilization and minimize the maximum load (makespan). * In the online version of this problem, jobs arrive one-by-one, and the system must assign them immediately without knowing what future jobs will look like. * Traditionally, these algorithms are evaluated using "competitive analysis," comparing the performance of an online algorithm against an optimal offline version that has full knowledge of the job sequence. ### The Tree Balancing Game * The problem is modeled as a game where an adversary presents edges of a tree (representing jobs and machines) one at a time. * For every undirected edge $(u, v)$ presented, the algorithm must choose an orientation ($u \to v$ or $v \to u$), with the goal of minimizing the maximum number of edges pointing at any single node. * In a worst-case adversarial arrival order, it has been mathematically proven since the 1990s that no deterministic algorithm can guarantee a maximum indegree of less than $\log n$, where $n$ is the number of nodes. ### Performance Under Random Arrival Orders * The research specifically investigates "random order arrivals," where every possible permutation of the job sequence is equally likely, simulating a more natural distribution than a malicious adversary. * While previous assumptions suggested that a simple "greedy algorithm" (assigning the job to the machine with the currently lower load) performed better in this model, this research proves a new, stricter lower bound. * The authors demonstrate that even with random arrivals, any online algorithm will still incur a maximum load proportional to at least $\sqrt{\log n}$. * For more general load balancing scenarios beyond simple trees, the researchers established a lower bound of $\sqrt{\log \log n}$. ### Practical Implications These findings suggest that while random job arrival provides a slight performance advantage over adversarial scenarios, system designers cannot rely on randomness alone to eliminate load imbalances. Because the maximum load grows predictably according to the $\sqrt{\log n}$ limit, large-scale systems must be architected to handle this inherent logarithmic growth in resource pressure to maintain high utilization and stability.