Kanana-2 Development Story ( (opens in new tab)
Kakao has introduced Kanana-2, a series of language models utilizing a Mixture of Experts (MoE) architecture to achieve high intelligence while maintaining low inference costs. To support the stable pre-training of their largest 155B parameter model, the team implemented advanced technical stacks including the Muon optimizer and MuonClip to prevent training instabilities. These developments reflect a strategic focus on balancing large-scale performance with "high-efficiency, low-cost" engineering. ### MoE Architecture and Scaling Strategy * Kanana-2 models, such as the 32B version, activate only 3B parameters during inference to maximize computational efficiency without sacrificing the intelligence of a larger model. * The team is currently training a massive 155B parameter version (Kanana-2-155b-a17b) using FP8 training infrastructure, MuonClip, and Hyperparameter Transfer to ensure stable convergence. * Custom-developed MoE kernels were integrated to reduce memory usage and increase training speed, resulting in a highly stable Loss Curve even during constant learning rate phases. ### A Controlled Testbed for Mid- and Post-Training * The Kanana-2-30b-a3b-base-2601 model was intentionally released without synthetic reasoning data to serve as a "clean" base for research. * This model allows researchers to investigate phenomena like "Reasoning Trace Distribution Mismatch" and "Spurious Rewards" by providing a baseline unaffected by post-training interventions. * By offering a high-quality Korean base model, Kakao aims to support the local AI community in conducting more rigorous experiments on mathematical and logical reasoning. ### Optimization with Muon and Polar Express * Kakao shifted from the industry-standard AdamW optimizer to Muon, which updates parameters by orthogonalizing gradients rather than performing element-wise updates. * To achieve more accurate orthogonalization, they implemented the Polar Express iterative algorithm instead of the standard Newton-Schulz method, aiming to reduce noise in weight updates during the latter stages of large-scale training. * The optimization process also involved detailed adjustments to RMSNorm parameterization and learning rate (LR) management to ensure the model scales effectively. ### Training Stability via MuonClip * To address potential "logit explosion" in large-scale models, the team utilized MuonClip, a technique that clips attention logits to maintain stability. * Because standard Flash Attention stores Max Logit values only on-chip, the team modified the Flash Attention kernels to extract and return these values for monitoring and clipping purposes. * Stress tests conducted with high learning rates proved that MuonClip prevents training divergence and maintains performance levels even when the model is pushed to its limits. The development of Kanana-2 demonstrates that scaling to hundreds of billions of parameters requires more than just data; it necessitates deep architectural optimizations and custom kernel engineering. For organizations looking to train large-scale MoE models, adopting sophisticated orthogonalization optimizers and logit clipping mechanisms is highly recommended to ensure predictable and stable model convergence.