distributed-training

3 posts

aws

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod | AWS News Blog (opens in new tab)

Amazon SageMaker HyperPod has introduced checkpointless and elastic training features to accelerate AI model development by minimizing infrastructure-related downtime. These advancements replace traditional, slow checkpoint-restart cycles with peer-to-peer state recovery and enable training workloads to scale dynamically based on available compute capacity. By decoupling training progress from static hardware configurations, organizations can significantly reduce model time-to-market while maximizing cluster utilization. **Checkpointless Training and Rapid State Recovery** * Replaces the traditional five-stage recovery process—including job termination, network setup, and checkpoint retrieval—which can often take up to an hour on self-managed clusters. * Utilizes peer-to-peer state replication and in-process recovery to allow healthy nodes to restore the model state instantly without restarting the entire job. * Incorporates technical optimizations such as collective communications initialization and memory-mapped data loading to enable efficient data caching. * Reduces recovery downtime by over 80% based on internal studies of clusters with up to 2,000 GPUs, and was a core technology used in the development of Amazon Nova models. **Elastic Training and Automated Cluster Scaling** * Allows AI workloads to automatically expand to use idle cluster capacity as it becomes available and contract when resources are needed for higher-priority tasks. * Reduces the need for manual intervention, saving hours of engineering time previously spent reconfiguring training jobs to match fluctuating compute availability. * Optimizes total cost of ownership by ensuring that training momentum continues even as inference volumes peak and pull resources away from the training pool. * Orchestrates these transitions seamlessly through the HyperPod training operator, ensuring that model development is not disrupted by infrastructure changes. For teams managing large-scale AI workloads, adopting these features can reclaim significant development time and lower operational costs by preventing idle cluster periods. Organizations scaling to thousands of accelerators should prioritize checkpointless training to mitigate the impact of hardware faults and maintain continuous training momentum.

meta

Zoomer: Powering AI Performance at Meta's Scale Through Intelligent Debugging and Optimization - Engineering at Meta (opens in new tab)

Zoomer is Meta’s centralized, automated platform designed to solve performance bottlenecks and GPU underutilization across its massive AI training and inference infrastructure. By integrating deep analytics with scalable data collection, the tool has become the internal standard for optimizing workloads ranging from Llama 3 training to large-scale ads recommendation engines. Ultimately, Zoomer enables significant energy savings and hardware efficiency gains, allowing Meta to accelerate model iteration and increase throughput across its global fleet of GPUs. ### The Three-Layered Architecture * **Infrastructure and Platform Layer:** This foundation utilizes Meta’s Manifold blob storage for trace data and employs fault-tolerant processing pipelines to manage massive trace files across thousands of hosts. * **Analytics and Insights Engine:** This layer performs deep analysis using specialized tools such as Kineto for GPU traces, NVIDIA DCGM for hardware metrics, and StrobeLight for CPU profiling. It automatically detects performance anti-patterns and provides actionable optimization recommendations. * **Visualization and User Interface Layer:** The presentation layer transforms complex data into interactive timelines and heat maps. It integrates with Perfetto for kernel-level inspection and provides drill-down dashboards that highlight outliers across distributed GPU deployments. ### Automated Profiling and Data Capture * **Trigger Mechanisms:** To ensure data accuracy, Zoomer automatically triggers profiling for training workloads during stable states (typically around iteration 550) to avoid startup noise, while inference workloads use on-demand or benchmark-integrated triggers. * **Comprehensive Metrics:** The platform simultaneously collects GPU SM utilization, Tensor Core usage, memory bandwidth, and power consumption via DCGM. * **System-Level Telemetry:** Beyond the GPU, Zoomer captures host-level data including CPU utilization, storage access patterns, and network I/O through dyno telemetry. * **Distributed Communication:** For large-scale training, the tool analyzes NCCL collective operations and inter-node communication patterns to identify stragglers and network bottlenecks. ### Inference and Training Optimization * **Inference Performance:** Zoomer tracks request/response latency, GPU memory allocation patterns, and Thrift request-level profiling to identify bottlenecks in serving user requests at scale. * **Workflow Acceleration:** By correlating application-level annotations—such as forward/backward passes and optimizer steps—with hardware performance, developers can pinpoint exactly which part of a model's execution is inefficient. * **Operational Impact:** These insights have led to significant improvements in Queries Per Second (QPS) for recommendation models and reduced training times for generative AI features by eliminating resource waste. For organizations managing large-scale AI clusters, the Zoomer model suggests that the key to efficiency is moving away from manual, reactive debugging toward an "always-on" automated profiling system. Correlating high-level software phases with low-level hardware telemetry is essential for maximizing the return on investment for expensive GPU resources and maintaining rapid iteration cycles.

google

Differentially private machine learning at scale with JAX-Privacy (opens in new tab)

Google DeepMind and Google Research have announced the release of JAX-Privacy 1.0, a high-performance library designed to scale differentially private (DP) machine learning. By leveraging JAX’s native parallelization and functional programming model, the toolkit enables researchers to train large-scale foundation models while maintaining rigorous privacy guarantees. This version introduces modular components for advanced algorithms and empirical auditing, making private training both computationally efficient and verifiable across distributed environments. ### Scaling Differential Privacy with JAX * The library is built directly on the JAX ecosystem, integrating seamlessly with Flax for neural network architectures and Optax for optimization. * It utilizes JAX’s `vmap` for automatic vectorization and `shard_map` for single-program multiple-data (SPMD) parallelization, allowing DP primitives to scale across multiple accelerators. * By using just-in-time (JIT) compilation, the library mitigates the traditional performance overhead associated with per-example gradient clipping and noise addition. ### Core Components and Advanced Algorithms * The toolkit provides fundamental building blocks for implementing standard DP algorithms like DP-SGD and DP-FTRL, including specialized modules for data batch construction. * It supports state-of-the-art methods such as DP matrix factorization, which improves performance by injecting correlated noise across training iterations. * Features like micro-batching and padding are included to handle the massive, variable-sized batches often required to achieve an optimal balance between privacy and model utility. ### Verification and Privacy Auditing * JAX-Privacy incorporates rigorous privacy accounting based on Rényi Differential Privacy to provide precise tracking of privacy budgets. * The library includes tools for empirical auditing, allowing developers to validate their privacy guarantees through techniques like membership inference attacks and data poisoning. * The design ensures correctness in distributed settings, specifically focusing on consistent noise generation and gradient synchronization across clusters. JAX-Privacy 1.0 is a robust solution for researchers and engineers who need to deploy production-grade private models. Its modular architecture and integration with high-performance computing primitives make it a primary choice for training foundation models on sensitive datasets without compromising on scalability or security.