ai-infrastructure

2 posts

meta

Zoomer: Powering AI Performance at Meta's Scale Through Intelligent Debugging and Optimization - Engineering at Meta (opens in new tab)

Zoomer is Meta’s centralized, automated platform designed to solve performance bottlenecks and GPU underutilization across its massive AI training and inference infrastructure. By integrating deep analytics with scalable data collection, the tool has become the internal standard for optimizing workloads ranging from Llama 3 training to large-scale ads recommendation engines. Ultimately, Zoomer enables significant energy savings and hardware efficiency gains, allowing Meta to accelerate model iteration and increase throughput across its global fleet of GPUs. ### The Three-Layered Architecture * **Infrastructure and Platform Layer:** This foundation utilizes Meta’s Manifold blob storage for trace data and employs fault-tolerant processing pipelines to manage massive trace files across thousands of hosts. * **Analytics and Insights Engine:** This layer performs deep analysis using specialized tools such as Kineto for GPU traces, NVIDIA DCGM for hardware metrics, and StrobeLight for CPU profiling. It automatically detects performance anti-patterns and provides actionable optimization recommendations. * **Visualization and User Interface Layer:** The presentation layer transforms complex data into interactive timelines and heat maps. It integrates with Perfetto for kernel-level inspection and provides drill-down dashboards that highlight outliers across distributed GPU deployments. ### Automated Profiling and Data Capture * **Trigger Mechanisms:** To ensure data accuracy, Zoomer automatically triggers profiling for training workloads during stable states (typically around iteration 550) to avoid startup noise, while inference workloads use on-demand or benchmark-integrated triggers. * **Comprehensive Metrics:** The platform simultaneously collects GPU SM utilization, Tensor Core usage, memory bandwidth, and power consumption via DCGM. * **System-Level Telemetry:** Beyond the GPU, Zoomer captures host-level data including CPU utilization, storage access patterns, and network I/O through dyno telemetry. * **Distributed Communication:** For large-scale training, the tool analyzes NCCL collective operations and inter-node communication patterns to identify stragglers and network bottlenecks. ### Inference and Training Optimization * **Inference Performance:** Zoomer tracks request/response latency, GPU memory allocation patterns, and Thrift request-level profiling to identify bottlenecks in serving user requests at scale. * **Workflow Acceleration:** By correlating application-level annotations—such as forward/backward passes and optimizer steps—with hardware performance, developers can pinpoint exactly which part of a model's execution is inefficient. * **Operational Impact:** These insights have led to significant improvements in Queries Per Second (QPS) for recommendation models and reduced training times for generative AI features by eliminating resource waste. For organizations managing large-scale AI clusters, the Zoomer model suggests that the key to efficiency is moving away from manual, reactive debugging toward an "always-on" automated profiling system. Correlating high-level software phases with low-level hardware telemetry is essential for maximizing the return on investment for expensive GPU resources and maintaining rapid iteration cycles.

google

Exploring a space-based, scalable AI infrastructure system design (opens in new tab)

Project Suncatcher is a Google moonshot initiative aimed at scaling machine learning infrastructure by deploying solar-powered satellite constellations equipped with Tensor Processing Units (TPUs). By leveraging the nearly continuous energy of the sun in specific orbits and utilizing high-bandwidth free-space optical links, the project seeks to bypass the resource constraints of terrestrial data centers. Early research suggests that a modular, tightly clustered satellite design can achieve the necessary compute density and communication speeds required for modern AI workloads. ### Data-Center Bandwidth via Optical Links * To match terrestrial performance, inter-satellite links must support tens of terabits per second using multi-channel dense wavelength-division multiplexing (DWDM) and spatial multiplexing. * The system addresses signal power loss (the link budget) by maintaining satellites in extremely close proximity—kilometers or less—compared to traditional long-range satellite deployments. * Initial bench-scale demonstrations have successfully achieved 800 Gbps each-way transmission (1.6 Tbps total) using a single transceiver pair, validating the feasibility of high-speed optical networking. ### Orbital Mechanics of Compact Constellations * The proposed system utilizes a sun-synchronous low-earth orbit (LEO) at an altitude of approximately 650 km to maximize solar exposure and minimize the weight of onboard batteries. * Researchers use Hill-Clohessy-Wiltshire equations and JAX-based differentiable models to manage the complex gravitational perturbations and atmospheric drag affecting satellites flying in tight 100–200m formations. * Simulations of 81-satellite clusters indicate that only modest station-keeping maneuvers are required to maintain stable, "free-fall" trajectories within the orbital plane. ### Hardware Resilience in Space Environments * The project specifically tests Google’s Trillium (v6e) Cloud TPUs to determine if terrestrial AI accelerators can survive the radiation found in LEO. * Hardware is subjected to 67MeV proton beams to analyze the impact of Total Ionizing Dose (TID) and Single Event Effects (SEEs) on processing reliability. * Preliminary testing indicates promising results for the radiation tolerance of high-performance accelerators, suggesting that standard TPU architectures may be viable for orbital deployment with minimal modification. While still in the research and development phase, Project Suncatcher suggests that the future of massive AI scaling may involve shifting infrastructure away from terrestrial limits and toward modular, energy-rich orbital environments. Organizations should monitor the progress of free-space optical communication and radiation-hardened accelerators as these technologies will be the primary gatekeepers for space-based computation.