mixture-of-experts

3 posts

kakao

Kanana-2 Development Story ( (opens in new tab)

Kakao has introduced Kanana-2, a series of language models utilizing a Mixture of Experts (MoE) architecture to achieve high intelligence while maintaining low inference costs. To support the stable pre-training of their largest 155B parameter model, the team implemented advanced technical stacks including the Muon optimizer and MuonClip to prevent training instabilities. These developments reflect a strategic focus on balancing large-scale performance with "high-efficiency, low-cost" engineering. ### MoE Architecture and Scaling Strategy * Kanana-2 models, such as the 32B version, activate only 3B parameters during inference to maximize computational efficiency without sacrificing the intelligence of a larger model. * The team is currently training a massive 155B parameter version (Kanana-2-155b-a17b) using FP8 training infrastructure, MuonClip, and Hyperparameter Transfer to ensure stable convergence. * Custom-developed MoE kernels were integrated to reduce memory usage and increase training speed, resulting in a highly stable Loss Curve even during constant learning rate phases. ### A Controlled Testbed for Mid- and Post-Training * The Kanana-2-30b-a3b-base-2601 model was intentionally released without synthetic reasoning data to serve as a "clean" base for research. * This model allows researchers to investigate phenomena like "Reasoning Trace Distribution Mismatch" and "Spurious Rewards" by providing a baseline unaffected by post-training interventions. * By offering a high-quality Korean base model, Kakao aims to support the local AI community in conducting more rigorous experiments on mathematical and logical reasoning. ### Optimization with Muon and Polar Express * Kakao shifted from the industry-standard AdamW optimizer to Muon, which updates parameters by orthogonalizing gradients rather than performing element-wise updates. * To achieve more accurate orthogonalization, they implemented the Polar Express iterative algorithm instead of the standard Newton-Schulz method, aiming to reduce noise in weight updates during the latter stages of large-scale training. * The optimization process also involved detailed adjustments to RMSNorm parameterization and learning rate (LR) management to ensure the model scales effectively. ### Training Stability via MuonClip * To address potential "logit explosion" in large-scale models, the team utilized MuonClip, a technique that clips attention logits to maintain stability. * Because standard Flash Attention stores Max Logit values only on-chip, the team modified the Flash Attention kernels to extract and return these values for monitoring and clipping purposes. * Stress tests conducted with high learning rates proved that MuonClip prevents training divergence and maintains performance levels even when the model is pushed to its limits. The development of Kanana-2 demonstrates that scaling to hundreds of billions of parameters requires more than just data; it necessitates deep architectural optimizations and custom kernel engineering. For organizations looking to train large-scale MoE models, adopting sophisticated orthogonalization optimizers and logit clipping mechanisms is highly recommended to ensure predictable and stable model convergence.

kakao

Smarter and More (opens in new tab)

Kakao has released Kanana-2, a high-performance open-source language model specifically engineered to power Agentic AI by enhancing tool-calling and instruction-following capabilities. Surpassing its predecessors and rivaling global frontier models like Qwen3, Kanana-2 offers a versatile suite of variants designed for practical, high-efficiency application in complex service environments. ### Optimized Model Lineup: Base, Instruct, and Thinking * **Kanana-2-30b-a3b-base:** Provided as a foundational model with pre-training weights, allowing researchers to fine-tune the model using their own datasets. * **Kanana-2-30b-a3b-instruct:** A version optimized through post-training to maximize the model's ability to follow complex user instructions accurately. * **Kanana-2-30b-a3b-thinking:** Kakao’s first reasoning-specialized model, designed for tasks requiring high-level logical thinking, such as mathematics and coding. ### Strengthening Agentic AI Capabilities * **Tool Calling:** Multi-turn tool-calling performance has improved more than threefold compared to Kanana-1.5, significantly enhancing its utility with the Model Context Protocol (MCP). * **Instruction Following:** The model's ability to understand and execute multi-step, complex user requirements has been refined to ensure reliable task completion. * **Reasoning-Tool Integration:** Unlike many reasoning models that lose instruction-following quality during deep thought, the "Thinking" variant maintains high performance in both logical deduction and tool use. ### High-Efficiency Architecture for Scale * **MLA (Multi-head Latent Attention):** Compresses memory usage to handle long contexts more efficiently, reducing the resources needed for extensive data processing. * **MoE (Mixture of Experts):** Activates only the necessary parameters during inference, maintaining high performance while drastically reducing computational costs and response times. * **Improved Tokenization:** A newly trained tokenizer has improved Korean language token efficiency by 30%, enabling faster throughput and lower latency in high-traffic environments like KakaoTalk. ### Expanded Multilingual Support * **Broad Linguistic Reach:** The model has expanded its support from just Korean and English to include six languages: Korean, English, Japanese, Chinese, Thai, and Vietnamese. By open-sourcing Kanana-2, Kakao provides a robust foundation for developers seeking to build responsive, tool-integrated AI services. Its focus on practical efficiency and advanced reasoning makes it an ideal choice for implementing agentic workflows in real-world applications where speed and accuracy are critical.