음성 AI 모델을 프로덕션에 올리기까지: Kanana-O 서빙 최적화 여정 (opens in new tab)
음성 AI 모델을 프로덕션에 올리기까지: Kanana-O 서빙 최적화 여정
25 posts
음성 AI 모델을 프로덕션에 올리기까지: Kanana-O 서빙 최적화 여정
수억 건의 보안 신호 속 진짜 위협 찾기 — AI로 보안 모니터링의 패러다임을 바꾸다
학생에서 개발자로: DB, 보안부터 AI까지, 정답보다 합리적인 선택을 배우다
학생에서 개발자로: 로또 구현부터 레거시 개선까지, 서버의 흐름을 배우다
한국 문화 이해부터 화면 조작까지: Kanana-V 기능 확장의 모든 것
2026 카카오그룹 신입크루 공채 코딩테스트 2차 문제해설
2026 카카오그룹 신입크루 공채 코딩테스트 1차 문제해설
잃어버린 리포트를 찾아서: 카카오 메시징 시스템의 경쟁 조건 문제와 안티 패턴 제거 과정
카카오 AI 앰배서더 ‘KANANA 429 앰배서더’를 신규 모집합니다.
Kakao has introduced Kanana-2, a series of language models utilizing a Mixture of Experts (MoE) architecture to achieve high intelligence while maintaining low inference costs. To support the stable pre-training of their largest 155B parameter model, the team implemented advanced technical stacks including the Muon optimizer and MuonClip to prevent training instabilities. These developments reflect a strategic focus on balancing large-scale performance with "high-efficiency, low-cost" engineering. ### MoE Architecture and Scaling Strategy * Kanana-2 models, such as the 32B version, activate only 3B parameters during inference to maximize computational efficiency without sacrificing the intelligence of a larger model. * The team is currently training a massive 155B parameter version (Kanana-2-155b-a17b) using FP8 training infrastructure, MuonClip, and Hyperparameter Transfer to ensure stable convergence. * Custom-developed MoE kernels were integrated to reduce memory usage and increase training speed, resulting in a highly stable Loss Curve even during constant learning rate phases. ### A Controlled Testbed for Mid- and Post-Training * The Kanana-2-30b-a3b-base-2601 model was intentionally released without synthetic reasoning data to serve as a "clean" base for research. * This model allows researchers to investigate phenomena like "Reasoning Trace Distribution Mismatch" and "Spurious Rewards" by providing a baseline unaffected by post-training interventions. * By offering a high-quality Korean base model, Kakao aims to support the local AI community in conducting more rigorous experiments on mathematical and logical reasoning. ### Optimization with Muon and Polar Express * Kakao shifted from the industry-standard AdamW optimizer to Muon, which updates parameters by orthogonalizing gradients rather than performing element-wise updates. * To achieve more accurate orthogonalization, they implemented the Polar Express iterative algorithm instead of the standard Newton-Schulz method, aiming to reduce noise in weight updates during the latter stages of large-scale training. * The optimization process also involved detailed adjustments to RMSNorm parameterization and learning rate (LR) management to ensure the model scales effectively. ### Training Stability via MuonClip * To address potential "logit explosion" in large-scale models, the team utilized MuonClip, a technique that clips attention logits to maintain stability. * Because standard Flash Attention stores Max Logit values only on-chip, the team modified the Flash Attention kernels to extract and return these values for monitoring and clipping purposes. * Stress tests conducted with high learning rates proved that MuonClip prevents training divergence and maintains performance levels even when the model is pushed to its limits. The development of Kanana-2 demonstrates that scaling to hundreds of billions of parameters requires more than just data; it necessitates deep architectural optimizations and custom kernel engineering. For organizations looking to train large-scale MoE models, adopting sophisticated orthogonalization optimizers and logit clipping mechanisms is highly recommended to ensure predictable and stable model convergence.
Kakao’s development of the Kanana-2 model family represents a strategic shift toward Agentic AI, prioritizing complex reasoning and execution capabilities over simple conversational fluency. By implementing a sophisticated post-training pipeline—including a specialized Mid-training stage and refined reinforcement learning—the team successfully enhanced the model's instruction-following and tool-calling performance. This methodology ensures that the 30B parameter models excel in logical tasks and real-world agentic environments while maintaining high linguistic stability in both English and Korean. ## Mid-training and Catastrophic Forgetting Prevention * A 250B token Mid-training stage was introduced between Pre-training and Post-training to bridge the gap in reasoning, coding, and tool-calling capabilities. * The dataset comprised 200B tokens of high-quality reasoning data (Chain-of-Thought math and code) and 50B tokens of "replay" data from the original pre-training set. * This replay strategy specifically targeted "Catastrophic Forgetting," preventing the model from losing its Korean linguistic nuances and performance on benchmarks like KoMT-bench while it gained English-heavy reasoning skills. * Experimental results indicated that Mid-training serves as a foundational "force multiplier," leading to faster convergence and higher performance ceilings during subsequent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages. ## Enhanced Instruction Following and Tool Calling * To optimize for Agentic AI, the developers focused on Instruction Following (IFEval) by synthesizing high-quality, long-form responses that strictly adhere to complex constraints. * Tool-calling capabilities were improved using "Rejection Sampling" (Iterative SFT), where model-generated trajectories are validated in a real execution environment; only successful outcomes are retained for training. * The training data was categorized into distinct buckets—such as Chat, Math, Code, and Tool Calling—allowing for a more balanced recipe compared to previous Kanana versions. * This approach specifically addressed multi-turn and multi-tool scenarios, ensuring the model can handle the recursive logic required for autonomous agents. ## Parallel Reinforcement Learning and Calibration Tuning * A "Parallel RL" framework was adopted to optimize different capabilities simultaneously: the "Chat" track focused on helpfulness and safety, while the "Logic" track focused on accuracy in math and programming. * The pipeline moved beyond standard SFT to include Reinforcement Learning from Human Feedback (RLHF), utilizing DPO and PPO-style methods to align the model with human preferences. * A final "Calibration Tuning" step was implemented to ensure the model’s internal confidence levels match its actual accuracy, effectively reducing hallucinations and improving reliability in technical tasks. * Comparative benchmarks show that the Kanana-2 Instruct and Thinking models significantly outperform earlier versions and rival larger open-source models in reasoning and coding benchmarks like HumanEval and GSM8K. The Kanana-2 development cycle demonstrates that achieving "Agentic" performance requires more than just scaling data; it requires a structured transition from general language understanding to execution-verified reasoning. For organizations building AI agents, the Kanana-2 post-training recipe suggests that integrating environment-validated feedback and balancing reasoning data with foundational language "replays" is critical for creating reliable, multi-functional models.
Kakao's Kanana-v-4b-hybrid is a multimodal language model designed to transcend simple image-to-text conversion by integrating logical reasoning and self-verification directly into its response process. By employing a hybrid architecture that handles both intuitive dialogue and complex visual reasoning within a single model, it achieves high accuracy and reliability for sophisticated tasks. This approach allows the model to maintain consistency in user experience while excelling in Korean-specific contexts, as evidenced by its record-breaking 92.8 score on the KoNET evaluation. ### Integrated Hybrid Architecture * Consolidates intuitive tasks (like OCR and summarization) and logical tasks (complex reasoning) into a single model to reduce system complexity and maintenance costs. * Eliminates the need for external routing between specialized models, ensuring a consistent tone, response format, and safety policy throughout a single conversation session. * Utilizes a refined training recipe that balances data ratios and visual reasoning training to ensure that improvements in multimodal understanding benefit all types of user queries. ### Visual Reasoning and Self-Reflection * Follows a natural logic flow: synthesizing information from images and text, applying conditions, verifying candidates, and finally concluding the response. * Features a "Reflection" mechanism where the model actively monitors its own thought process to catch "small but fatal" errors, such as calculation mistakes or missed constraints. * Excels in high-stakes visual tasks like receipt auditing, table filtering, and mathematical problem-solving by double-checking intermediate results against original image data. ### Native Korean Logical Processing * Prioritizes "thinking in Korean" to accurately preserve the nuances of complex constraints, such as "except for X" or "only in cases of Y," which are often lost during internal translation. * Develops a native Korean Rationale process to prevent logical drift, ensuring that the internal reasoning steps remain perfectly aligned with the linguistic structure of the user's query. * Addresses the difficulty of processing information scattered throughout Korean-language documents or exam papers by synthesizing data without language-conversion overhead. Kanana-v-4b-hybrid marks a shift toward "verifiable AI" that provides evidence-based answers rather than just plausible text. For applications in education, finance, or complex document processing, this model offers a blueprint for building trust through transparent reasoning and self-correction.