fault-tolerance

2 posts

aws

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod | AWS News Blog (opens in new tab)

Amazon SageMaker HyperPod has introduced checkpointless and elastic training features to accelerate AI model development by minimizing infrastructure-related downtime. These advancements replace traditional, slow checkpoint-restart cycles with peer-to-peer state recovery and enable training workloads to scale dynamically based on available compute capacity. By decoupling training progress from static hardware configurations, organizations can significantly reduce model time-to-market while maximizing cluster utilization. **Checkpointless Training and Rapid State Recovery** * Replaces the traditional five-stage recovery process—including job termination, network setup, and checkpoint retrieval—which can often take up to an hour on self-managed clusters. * Utilizes peer-to-peer state replication and in-process recovery to allow healthy nodes to restore the model state instantly without restarting the entire job. * Incorporates technical optimizations such as collective communications initialization and memory-mapped data loading to enable efficient data caching. * Reduces recovery downtime by over 80% based on internal studies of clusters with up to 2,000 GPUs, and was a core technology used in the development of Amazon Nova models. **Elastic Training and Automated Cluster Scaling** * Allows AI workloads to automatically expand to use idle cluster capacity as it becomes available and contract when resources are needed for higher-priority tasks. * Reduces the need for manual intervention, saving hours of engineering time previously spent reconfiguring training jobs to match fluctuating compute availability. * Optimizes total cost of ownership by ensuring that training momentum continues even as inference volumes peak and pull resources away from the training pool. * Orchestrates these transitions seamlessly through the HyperPod training operator, ensuring that model development is not disrupted by infrastructure changes. For teams managing large-scale AI workloads, adopting these features can reclaim significant development time and lower operational costs by preventing idle cluster periods. Organizations scaling to thousands of accelerators should prioritize checkpointless training to mitigate the impact of hardware faults and maintain continuous training momentum.

google

A colorful quantum future (opens in new tab)

Google Quantum AI researchers have successfully implemented "color codes" for quantum error correction on the superconducting Willow chip, presenting a more efficient alternative to the standard surface code. This approach utilizes a unique triangular geometry to reduce the number of physical qubits required for a logical qubit while dramatically increasing the speed of logical operations. The results demonstrate that the system has crossed the performance threshold where increasing the code distance successfully suppresses logical error rates. ## Resource Efficiency through Triangular Geometry * Unlike the square-shaped surface code, the color code uses a hexagonal tiling arranged in a triangular patch to encode logical information. * This geometric configuration requires significantly fewer physical qubits to achieve the same "distance" (the number of physical errors needed to cause a logical error) compared to surface codes. * Experimental results comparing distance-3 and distance-5 color codes showed a 1.56× suppression in logical error rates at the higher distance, confirming the code's viability on current hardware. * While the color code requires more complex decoding algorithms and deeper physical circuits, recent advances in decoders like AlphaQubit have enabled the system to operate below the error correction threshold. ## Accelerating Logical Gates * Color codes allow for many single-qubit logical operations to be executed in a single step (transversal gates), whereas surface codes often require multiple error-correction cycles. * A logical Hadamard gate, for instance, can be executed in approximately 20ns using a color code, which is nearly 1,000 times faster than the same operation on a surface code. * Faster execution reduces the number of error-correction cycles an algorithm must endure, which indirectly lowers the physical qubit requirements for maintaining logical stability. * The research team verified these improvements through "logical randomized benchmarking," confirming high-fidelity execution of logical operations. ## Logical State Injection and Magic States * The researchers demonstrated a "state injection" technique, which is the process of preparing a physical qubit in a specific state and then expanding it into a protected logical state. * This process is essential for creating "magic states" (T-states), which are necessary for performing the arbitrary qubit rotations required for complex quantum algorithms. * By moving states from the physical to the logical level, the color code architecture provides a clear path toward executing the universal gate sets needed to outperform classical computers. While the color code currently exhibits a lower error suppression factor than the surface code, its advantages in hardware efficiency and gate speed suggest it may be the superior architecture for large-scale, fault-tolerant quantum computing as device hardware continues to improve.