카카오 / llm

5 posts

kakao

Kanana-2 개발기 (1): Pre-training에서의 의사결정들을 중심으로 (opens in new tab)

Kakao has introduced Kanana-2, a series of language models utilizing a Mixture of Experts (MoE) architecture to achieve high intelligence while maintaining low inference costs. To support the stable pre-training of their largest 155B parameter model, the team implemented advanced technical stacks including the Muon optimizer and MuonClip to prevent training instabilities. These developments reflect a strategic focus on balancing large-scale performance with "high-efficiency, low-cost" engineering. ### MoE Architecture and Scaling Strategy * Kanana-2 models, such as the 32B version, activate only 3B parameters during inference to maximize computational efficiency without sacrificing the intelligence of a larger model. * The team is currently training a massive 155B parameter version (Kanana-2-155b-a17b) using FP8 training infrastructure, MuonClip, and Hyperparameter Transfer to ensure stable convergence. * Custom-developed MoE kernels were integrated to reduce memory usage and increase training speed, resulting in a highly stable Loss Curve even during constant learning rate phases. ### A Controlled Testbed for Mid- and Post-Training * The Kanana-2-30b-a3b-base-2601 model was intentionally released without synthetic reasoning data to serve as a "clean" base for research. * This model allows researchers to investigate phenomena like "Reasoning Trace Distribution Mismatch" and "Spurious Rewards" by providing a baseline unaffected by post-training interventions. * By offering a high-quality Korean base model, Kakao aims to support the local AI community in conducting more rigorous experiments on mathematical and logical reasoning. ### Optimization with Muon and Polar Express * Kakao shifted from the industry-standard AdamW optimizer to Muon, which updates parameters by orthogonalizing gradients rather than performing element-wise updates. * To achieve more accurate orthogonalization, they implemented the Polar Express iterative algorithm instead of the standard Newton-Schulz method, aiming to reduce noise in weight updates during the latter stages of large-scale training. * The optimization process also involved detailed adjustments to RMSNorm parameterization and learning rate (LR) management to ensure the model scales effectively. ### Training Stability via MuonClip * To address potential "logit explosion" in large-scale models, the team utilized MuonClip, a technique that clips attention logits to maintain stability. * Because standard Flash Attention stores Max Logit values only on-chip, the team modified the Flash Attention kernels to extract and return these values for monitoring and clipping purposes. * Stress tests conducted with high learning rates proved that MuonClip prevents training divergence and maintains performance levels even when the model is pushed to its limits. The development of Kanana-2 demonstrates that scaling to hundreds of billions of parameters requires more than just data; it necessitates deep architectural optimizations and custom kernel engineering. For organizations looking to train large-scale MoE models, adopting sophisticated orthogonalization optimizers and logit clipping mechanisms is highly recommended to ensure predictable and stable model convergence.

kakao

Kanana-2 개발기 (2): 개선된 post-training recipe를 중심으로 (opens in new tab)

Kakao’s development of the Kanana-2 model family represents a strategic shift toward Agentic AI, prioritizing complex reasoning and execution capabilities over simple conversational fluency. By implementing a sophisticated post-training pipeline—including a specialized Mid-training stage and refined reinforcement learning—the team successfully enhanced the model's instruction-following and tool-calling performance. This methodology ensures that the 30B parameter models excel in logical tasks and real-world agentic environments while maintaining high linguistic stability in both English and Korean. ## Mid-training and Catastrophic Forgetting Prevention * A 250B token Mid-training stage was introduced between Pre-training and Post-training to bridge the gap in reasoning, coding, and tool-calling capabilities. * The dataset comprised 200B tokens of high-quality reasoning data (Chain-of-Thought math and code) and 50B tokens of "replay" data from the original pre-training set. * This replay strategy specifically targeted "Catastrophic Forgetting," preventing the model from losing its Korean linguistic nuances and performance on benchmarks like KoMT-bench while it gained English-heavy reasoning skills. * Experimental results indicated that Mid-training serves as a foundational "force multiplier," leading to faster convergence and higher performance ceilings during subsequent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages. ## Enhanced Instruction Following and Tool Calling * To optimize for Agentic AI, the developers focused on Instruction Following (IFEval) by synthesizing high-quality, long-form responses that strictly adhere to complex constraints. * Tool-calling capabilities were improved using "Rejection Sampling" (Iterative SFT), where model-generated trajectories are validated in a real execution environment; only successful outcomes are retained for training. * The training data was categorized into distinct buckets—such as Chat, Math, Code, and Tool Calling—allowing for a more balanced recipe compared to previous Kanana versions. * This approach specifically addressed multi-turn and multi-tool scenarios, ensuring the model can handle the recursive logic required for autonomous agents. ## Parallel Reinforcement Learning and Calibration Tuning * A "Parallel RL" framework was adopted to optimize different capabilities simultaneously: the "Chat" track focused on helpfulness and safety, while the "Logic" track focused on accuracy in math and programming. * The pipeline moved beyond standard SFT to include Reinforcement Learning from Human Feedback (RLHF), utilizing DPO and PPO-style methods to align the model with human preferences. * A final "Calibration Tuning" step was implemented to ensure the model’s internal confidence levels match its actual accuracy, effectively reducing hallucinations and improving reliability in technical tasks. * Comparative benchmarks show that the Kanana-2 Instruct and Thinking models significantly outperform earlier versions and rival larger open-source models in reasoning and coding benchmarks like HumanEval and GSM8K. The Kanana-2 development cycle demonstrates that achieving "Agentic" performance requires more than just scaling data; it requires a structured transition from general language understanding to execution-verified reasoning. For organizations building AI agents, the Kanana-2 post-training recipe suggests that integrating environment-validated feedback and balancing reasoning data with foundational language "replays" is critical for creating reliable, multi-functional models.

kakao

더 똑똑하고 효율적인 Kanana-2 오픈소스 공개 (opens in new tab)

Kakao has released Kanana-2, a high-performance open-source language model specifically engineered to power Agentic AI by enhancing tool-calling and instruction-following capabilities. Surpassing its predecessors and rivaling global frontier models like Qwen3, Kanana-2 offers a versatile suite of variants designed for practical, high-efficiency application in complex service environments. ### Optimized Model Lineup: Base, Instruct, and Thinking * **Kanana-2-30b-a3b-base:** Provided as a foundational model with pre-training weights, allowing researchers to fine-tune the model using their own datasets. * **Kanana-2-30b-a3b-instruct:** A version optimized through post-training to maximize the model's ability to follow complex user instructions accurately. * **Kanana-2-30b-a3b-thinking:** Kakao’s first reasoning-specialized model, designed for tasks requiring high-level logical thinking, such as mathematics and coding. ### Strengthening Agentic AI Capabilities * **Tool Calling:** Multi-turn tool-calling performance has improved more than threefold compared to Kanana-1.5, significantly enhancing its utility with the Model Context Protocol (MCP). * **Instruction Following:** The model's ability to understand and execute multi-step, complex user requirements has been refined to ensure reliable task completion. * **Reasoning-Tool Integration:** Unlike many reasoning models that lose instruction-following quality during deep thought, the "Thinking" variant maintains high performance in both logical deduction and tool use. ### High-Efficiency Architecture for Scale * **MLA (Multi-head Latent Attention):** Compresses memory usage to handle long contexts more efficiently, reducing the resources needed for extensive data processing. * **MoE (Mixture of Experts):** Activates only the necessary parameters during inference, maintaining high performance while drastically reducing computational costs and response times. * **Improved Tokenization:** A newly trained tokenizer has improved Korean language token efficiency by 30%, enabling faster throughput and lower latency in high-traffic environments like KakaoTalk. ### Expanded Multilingual Support * **Broad Linguistic Reach:** The model has expanded its support from just Korean and English to include six languages: Korean, English, Japanese, Chinese, Thai, and Vietnamese. By open-sourcing Kanana-2, Kakao provides a robust foundation for developers seeking to build responsive, tool-integrated AI services. Its focus on practical efficiency and advanced reasoning makes it an ideal choice for implementing agentic workflows in real-world applications where speed and accuracy are critical.

kakao

​한국어와 이미지를 한 번에, 카카오의 멀티모달 임베딩 모델 개발기 (opens in new tab)

Kakao has developed Kanana-v-embedding, a specialized multimodal embedding model designed to bridge the gap between Korean text and visual data within a unified semantic space. By leveraging a Vision-Language Model (VLM) framework, the model enables seamless search and recommendation across various combinations of text and images, offering a significant performance boost over existing English-centric models like CLIP. This development provides a robust technical foundation for enhancing Kakao’s services, including RAG-based systems and localized content discovery. ### Unified Multimodal Meaning Space * The model maps text and images into a single vector space where semantic similarity is measured via cosine similarity. * Unlike traditional CLIP models that use independent encoders, this architecture treats text and images as a single sequence, allowing for "text + image" combined queries. * It supports four primary interaction modes: Text-to-Text, Text-to-Image, Image-to-Image, and (Text+Image)-to-(Text+Image). ### VLM-Based Architecture and Instruction Tuning * The system utilizes a VLM consisting of an LLM and an image encoder, extracting embeddings from the final hidden state of the [EOS] token. * It employs instruction-based query embedding, where specific prompts (e.g., "Find an image matching this caption") guide the model to generate embeddings tailored to the specific task, such as retrieval or classification. * The model is optimized for the Korean language and cultural context, addressing the limitations of previous models that struggled with non-English data. ### Advanced Training for Scalability and Precision * **Gradient Caching:** To overcome GPU memory limitations, this technique allows the model to train with effectively large batch sizes, which is critical for the InfoNCE loss used in contrastive learning. * **Matryoshka Representation Learning (MRL):** The model supports flexible embedding sizes ranging from 64 to 2,048 dimensions. This allows services to choose between low-latency (smaller dimensions) or high-precision (larger dimensions) without retraining. * **Hard Negative Mining:** The training process incorporates "hard negatives"—items that are similar but incorrect—to sharpen the model’s ability to distinguish between subtle differences in data. ### Performance Benchmarks and Efficiency * Kanana-v-embedding significantly outperforms CLIP and VLM2Vec on the KoEmbed benchmark, particularly in Korean Text-to-Image and Image-to-Text retrieval tasks. * In the M-BEIR (Multimodal Benchmark for Retrieval), the model demonstrated superior performance in multimodal document retrieval and image-to-text tasks compared to established open-source models. * Evaluation of MRL showed that the model retains high accuracy even when dimensions are reduced to 256 or 512, providing a 4x to 8x improvement in storage and search efficiency with minimal loss in quality. For organizations looking to implement multimodal RAG or advanced recommendation systems in Korean-language environments, Kanana-v-embedding offers a highly adaptable solution. Its ability to balance computational cost and retrieval quality through Matryoshka learning makes it particularly suitable for large-scale production environments where latency is a primary concern.

kakao

[AI_TOP_100] 문제 출제 후기 – 기술이 아닌, 사람을 묻다. (opens in new tab)

The AI TOP 100 contest was designed to shift the focus from evaluating AI model performance to measuring human proficiency in solving real-world problems through AI collaboration. By prioritizing the "problem-solving process" over mere final output, the organizers sought to identify individuals who can define clear goals and navigate the technical limitations of current AI tools. The conclusion of this initiative suggests that true AI literacy is defined by the ability to maintain a "human-in-the-loop" workflow where human intuition guides AI execution and verification. ### Core Philosophy of Human-AI Collaboration * **Human-in-the-Loop:** The contest emphasizes a cycle of human analysis, AI problem-solving, and human verification. This ensures that the human remains the "pilot" who directs the AI engine and takes responsibility for the quality of the result. * **Strategic Intervention:** Participants were encouraged to provide AI with structural context it might struggle to perceive (like complex table relationships) and to perform data pre-processing to improve AI accuracy. * **Task Delegation:** For complex iterative tasks, such as generating images for a montage, solvers were expected to build automated pipelines using AI agents to handle repetitive feedback loops while focusing human effort on higher-level strategy. ### Designing Against "One-Shot" Solutions * **Low Barrier, High Ceiling:** Problems were designed to be intuitive enough for anyone to understand but complex enough to prevent "one-shot" solutions (the "click-and-solve" trap). * **Targeting Technical Weaknesses:** Organizers intentionally embedded technical hurdles that current LLMs struggle with, forcing participants to demonstrate how they bridge the gap between AI limitations and a correct answer. * **The Difficulty Ladder:** To account for varying domain expertise (e.g., OCR experience), problems utilized a multi-part structure. This included "Easy" starting questions to build momentum and "Medium" hint questions that guided participants toward solving the more difficult "Killer" components. ### The 4-Pattern Problem Framework * **P1 - Insight (Analysis & Definition):** Identifying meaningful opportunities or problems within complex, unstructured data. * **P2 - Action (Implementation & Automation):** Developing functional code or workflows to execute a defined solution. * **P3 - Persuasion (Strategy & Creativity):** Generating logical and creative content to communicate technical solutions to non-technical stakeholders. * **P4 - Decision (Optimization):** Making optimal choices and simulations to maximize goals under specific constraints. ### Quality Assurance and Score Calibration * **4-Stage Pipeline:** Problems moved from Ideation to Drafting (testing for one-shot immunity), then to Candidate (analyzing abuse vulnerabilities), and finally to a Final selection based on difficulty balance. * **Cross-Model Validation:** Internal and alpha testers solved problems using various models including Claude, GPT, and Gemini to ensure that no single tool could bypass the intended human-led process. * **Effort-Based Scoring:** Instead of uniform points, scores were calibrated based on the "effort cost" and human competency required to solve them. This resulted in varying total points per problem to better reflect the true difficulty of the task. In the era of rapidly evolving AI, the ability to "use" a tool is becoming less valuable than the ability to "collaborate" with it. This shift requires a move toward building automated pipelines and utilizing a "difficulty ladder" approach to tackle complex, multi-stage problems that AI cannot yet solve in a single iteration.