meta Nov 21, 2025

Zoomer: Powering AI Performance at Meta's Scale Through Intelligent Debugging and Optimization - Engineering at Meta (opens in new tab)

distributed-training ai-infrastructure gpu-optimization performance-profiling pytorch-profiler nvidia-dcgm machine-learning-inference kineto

Zoomer is Meta’s centralized, automated platform designed to solve performance bottlenecks and GPU underutilization across its massive AI training and inference infrastructure. By integrating deep analytics with scalable data collection, the tool has become the internal standard for optimizing workloads ranging from Llama 3 training to large-scale ads recommendation engines. Ultimately, Zoomer enables significant energy savings and hardware efficiency gains, allowing Meta to accelerate model iteration and increase throughput across its global fleet of GPUs.

The Three-Layered Architecture

Infrastructure and Platform Layer: This foundation utilizes Meta’s Manifold blob storage for trace data and employs fault-tolerant processing pipelines to manage massive trace files across thousands of hosts.
Analytics and Insights Engine: This layer performs deep analysis using specialized tools such as Kineto for GPU traces, NVIDIA DCGM for hardware metrics, and StrobeLight for CPU profiling. It automatically detects performance anti-patterns and provides actionable optimization recommendations.
Visualization and User Interface Layer: The presentation layer transforms complex data into interactive timelines and heat maps. It integrates with Perfetto for kernel-level inspection and provides drill-down dashboards that highlight outliers across distributed GPU deployments.

Automated Profiling and Data Capture

Trigger Mechanisms: To ensure data accuracy, Zoomer automatically triggers profiling for training workloads during stable states (typically around iteration 550) to avoid startup noise, while inference workloads use on-demand or benchmark-integrated triggers.
Comprehensive Metrics: The platform simultaneously collects GPU SM utilization, Tensor Core usage, memory bandwidth, and power consumption via DCGM.
System-Level Telemetry: Beyond the GPU, Zoomer captures host-level data including CPU utilization, storage access patterns, and network I/O through dyno telemetry.
Distributed Communication: For large-scale training, the tool analyzes NCCL collective operations and inter-node communication patterns to identify stragglers and network bottlenecks.

Inference and Training Optimization

Inference Performance: Zoomer tracks request/response latency, GPU memory allocation patterns, and Thrift request-level profiling to identify bottlenecks in serving user requests at scale.
Workflow Acceleration: By correlating application-level annotations—such as forward/backward passes and optimizer steps—with hardware performance, developers can pinpoint exactly which part of a model's execution is inefficient.
Operational Impact: These insights have led to significant improvements in Queries Per Second (QPS) for recommendation models and reduced training times for generative AI features by eliminating resource waste.

For organizations managing large-scale AI clusters, the Zoomer model suggests that the key to efficiency is moving away from manual, reactive debugging toward an "always-on" automated profiling system. Correlating high-level software phases with low-level hardware telemetry is essential for maximizing the return on investment for expensive GPU resources and maintaining rapid iteration cycles.