We’re sharing details of the role backend aggregation (BAG) plays in building Meta’s gigawatt-scale AI clusters like Prometheus. BAG allows us to seamlessly connect thousands of GPUs across multiple data centers and regions. Our BAG implementation is connecting two different net…
Kakao has introduced Kanana-2, a series of language models utilizing a Mixture of Experts (MoE) architecture to achieve high intelligence while maintaining low inference costs. To support the stable pre-training of their largest 155B parameter model, the team implemented advanced technical stacks including the Muon optimizer and MuonClip to prevent training instabilities. These developments reflect a strategic focus on balancing large-scale performance with "high-efficiency, low-cost" engineering.
### MoE Architecture and Scaling Strategy
* Kanana-2 models, such as the 32B version, activate only 3B parameters during inference to maximize computational efficiency without sacrificing the intelligence of a larger model.
* The team is currently training a massive 155B parameter version (Kanana-2-155b-a17b) using FP8 training infrastructure, MuonClip, and Hyperparameter Transfer to ensure stable convergence.
* Custom-developed MoE kernels were integrated to reduce memory usage and increase training speed, resulting in a highly stable Loss Curve even during constant learning rate phases.
### A Controlled Testbed for Mid- and Post-Training
* The Kanana-2-30b-a3b-base-2601 model was intentionally released without synthetic reasoning data to serve as a "clean" base for research.
* This model allows researchers to investigate phenomena like "Reasoning Trace Distribution Mismatch" and "Spurious Rewards" by providing a baseline unaffected by post-training interventions.
* By offering a high-quality Korean base model, Kakao aims to support the local AI community in conducting more rigorous experiments on mathematical and logical reasoning.
### Optimization with Muon and Polar Express
* Kakao shifted from the industry-standard AdamW optimizer to Muon, which updates parameters by orthogonalizing gradients rather than performing element-wise updates.
* To achieve more accurate orthogonalization, they implemented the Polar Express iterative algorithm instead of the standard Newton-Schulz method, aiming to reduce noise in weight updates during the latter stages of large-scale training.
* The optimization process also involved detailed adjustments to RMSNorm parameterization and learning rate (LR) management to ensure the model scales effectively.
### Training Stability via MuonClip
* To address potential "logit explosion" in large-scale models, the team utilized MuonClip, a technique that clips attention logits to maintain stability.
* Because standard Flash Attention stores Max Logit values only on-chip, the team modified the Flash Attention kernels to extract and return these values for monitoring and clipping purposes.
* Stress tests conducted with high learning rates proved that MuonClip prevents training divergence and maintains performance levels even when the model is pushed to its limits.
The development of Kanana-2 demonstrates that scaling to hundreds of billions of parameters requires more than just data; it necessitates deep architectural optimizations and custom kernel engineering. For organizations looking to train large-scale MoE models, adopting sophisticated orthogonalization optimizers and logit clipping mechanisms is highly recommended to ensure predictable and stable model convergence.
2025 was a breakout year for early-stage startups: founders launched more companies and generated revenue faster than ever. Delaware C corporations grew an average of 41% year over year for the past 6 months, according to the Delaware Division of Corporations. US investors deplo…
Zoomer is Meta’s centralized, automated platform designed to solve performance bottlenecks and GPU underutilization across its massive AI training and inference infrastructure. By integrating deep analytics with scalable data collection, the tool has become the internal standard for optimizing workloads ranging from Llama 3 training to large-scale ads recommendation engines. Ultimately, Zoomer enables significant energy savings and hardware efficiency gains, allowing Meta to accelerate model iteration and increase throughput across its global fleet of GPUs.
### The Three-Layered Architecture
* **Infrastructure and Platform Layer:** This foundation utilizes Meta’s Manifold blob storage for trace data and employs fault-tolerant processing pipelines to manage massive trace files across thousands of hosts.
* **Analytics and Insights Engine:** This layer performs deep analysis using specialized tools such as Kineto for GPU traces, NVIDIA DCGM for hardware metrics, and StrobeLight for CPU profiling. It automatically detects performance anti-patterns and provides actionable optimization recommendations.
* **Visualization and User Interface Layer:** The presentation layer transforms complex data into interactive timelines and heat maps. It integrates with Perfetto for kernel-level inspection and provides drill-down dashboards that highlight outliers across distributed GPU deployments.
### Automated Profiling and Data Capture
* **Trigger Mechanisms:** To ensure data accuracy, Zoomer automatically triggers profiling for training workloads during stable states (typically around iteration 550) to avoid startup noise, while inference workloads use on-demand or benchmark-integrated triggers.
* **Comprehensive Metrics:** The platform simultaneously collects GPU SM utilization, Tensor Core usage, memory bandwidth, and power consumption via DCGM.
* **System-Level Telemetry:** Beyond the GPU, Zoomer captures host-level data including CPU utilization, storage access patterns, and network I/O through dyno telemetry.
* **Distributed Communication:** For large-scale training, the tool analyzes NCCL collective operations and inter-node communication patterns to identify stragglers and network bottlenecks.
### Inference and Training Optimization
* **Inference Performance:** Zoomer tracks request/response latency, GPU memory allocation patterns, and Thrift request-level profiling to identify bottlenecks in serving user requests at scale.
* **Workflow Acceleration:** By correlating application-level annotations—such as forward/backward passes and optimizer steps—with hardware performance, developers can pinpoint exactly which part of a model's execution is inefficient.
* **Operational Impact:** These insights have led to significant improvements in Queries Per Second (QPS) for recommendation models and reduced training times for generative AI features by eliminating resource waste.
For organizations managing large-scale AI clusters, the Zoomer model suggests that the key to efficiency is moving away from manual, reactive debugging toward an "always-on" automated profiling system. Correlating high-level software phases with low-level hardware telemetry is essential for maximizing the return on investment for expensive GPU resources and maintaining rapid iteration cycles.
Project Suncatcher is a Google moonshot initiative aimed at scaling machine learning infrastructure by deploying solar-powered satellite constellations equipped with Tensor Processing Units (TPUs). By leveraging the nearly continuous energy of the sun in specific orbits and utilizing high-bandwidth free-space optical links, the project seeks to bypass the resource constraints of terrestrial data centers. Early research suggests that a modular, tightly clustered satellite design can achieve the necessary compute density and communication speeds required for modern AI workloads.
### Data-Center Bandwidth via Optical Links
* To match terrestrial performance, inter-satellite links must support tens of terabits per second using multi-channel dense wavelength-division multiplexing (DWDM) and spatial multiplexing.
* The system addresses signal power loss (the link budget) by maintaining satellites in extremely close proximity—kilometers or less—compared to traditional long-range satellite deployments.
* Initial bench-scale demonstrations have successfully achieved 800 Gbps each-way transmission (1.6 Tbps total) using a single transceiver pair, validating the feasibility of high-speed optical networking.
### Orbital Mechanics of Compact Constellations
* The proposed system utilizes a sun-synchronous low-earth orbit (LEO) at an altitude of approximately 650 km to maximize solar exposure and minimize the weight of onboard batteries.
* Researchers use Hill-Clohessy-Wiltshire equations and JAX-based differentiable models to manage the complex gravitational perturbations and atmospheric drag affecting satellites flying in tight 100–200m formations.
* Simulations of 81-satellite clusters indicate that only modest station-keeping maneuvers are required to maintain stable, "free-fall" trajectories within the orbital plane.
### Hardware Resilience in Space Environments
* The project specifically tests Google’s Trillium (v6e) Cloud TPUs to determine if terrestrial AI accelerators can survive the radiation found in LEO.
* Hardware is subjected to 67MeV proton beams to analyze the impact of Total Ionizing Dose (TID) and Single Event Effects (SEEs) on processing reliability.
* Preliminary testing indicates promising results for the radiation tolerance of high-performance accelerators, suggesting that standard TPU architectures may be viable for orbital deployment with minimal modification.
While still in the research and development phase, Project Suncatcher suggests that the future of massive AI scaling may involve shifting infrastructure away from terrestrial limits and toward modular, energy-rich orbital environments. Organizations should monitor the progress of free-space optical communication and radiation-hardened accelerators as these technologies will be the primary gatekeepers for space-based computation.
Hack Week 2025: How these engineers liquid-cooled a GPU server Hack Week 2025 at Dropbox centered on the theme “Keep It Simple,” offering opportunities for innovation, experimentation, and finding smart solutions to complex challenges. With in-person hubs in San Francisco, Seatt…