Kanana-2 Development Story (2 (opens in new tab)
Kakao’s development of the Kanana-2 model family represents a strategic shift toward Agentic AI, prioritizing complex reasoning and execution capabilities over simple conversational fluency. By implementing a sophisticated post-training pipeline—including a specialized Mid-training stage and refined reinforcement learning—the team successfully enhanced the model's instruction-following and tool-calling performance. This methodology ensures that the 30B parameter models excel in logical tasks and real-world agentic environments while maintaining high linguistic stability in both English and Korean.
Mid-training and Catastrophic Forgetting Prevention
- A 250B token Mid-training stage was introduced between Pre-training and Post-training to bridge the gap in reasoning, coding, and tool-calling capabilities.
- The dataset comprised 200B tokens of high-quality reasoning data (Chain-of-Thought math and code) and 50B tokens of "replay" data from the original pre-training set.
- This replay strategy specifically targeted "Catastrophic Forgetting," preventing the model from losing its Korean linguistic nuances and performance on benchmarks like KoMT-bench while it gained English-heavy reasoning skills.
- Experimental results indicated that Mid-training serves as a foundational "force multiplier," leading to faster convergence and higher performance ceilings during subsequent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages.
Enhanced Instruction Following and Tool Calling
- To optimize for Agentic AI, the developers focused on Instruction Following (IFEval) by synthesizing high-quality, long-form responses that strictly adhere to complex constraints.
- Tool-calling capabilities were improved using "Rejection Sampling" (Iterative SFT), where model-generated trajectories are validated in a real execution environment; only successful outcomes are retained for training.
- The training data was categorized into distinct buckets—such as Chat, Math, Code, and Tool Calling—allowing for a more balanced recipe compared to previous Kanana versions.
- This approach specifically addressed multi-turn and multi-tool scenarios, ensuring the model can handle the recursive logic required for autonomous agents.
Parallel Reinforcement Learning and Calibration Tuning
- A "Parallel RL" framework was adopted to optimize different capabilities simultaneously: the "Chat" track focused on helpfulness and safety, while the "Logic" track focused on accuracy in math and programming.
- The pipeline moved beyond standard SFT to include Reinforcement Learning from Human Feedback (RLHF), utilizing DPO and PPO-style methods to align the model with human preferences.
- A final "Calibration Tuning" step was implemented to ensure the model’s internal confidence levels match its actual accuracy, effectively reducing hallucinations and improving reliability in technical tasks.
- Comparative benchmarks show that the Kanana-2 Instruct and Thinking models significantly outperform earlier versions and rival larger open-source models in reasoning and coding benchmarks like HumanEval and GSM8K.
The Kanana-2 development cycle demonstrates that achieving "Agentic" performance requires more than just scaling data; it requires a structured transition from general language understanding to execution-verified reasoning. For organizations building AI agents, the Kanana-2 post-training recipe suggests that integrating environment-validated feedback and balancing reasoning data with foundational language "replays" is critical for creating reliable, multi-functional models.