Aditya Challapally leads post-training research and infrastructure for Copilot agent capabilities that process millions of multimodal interactions. This post builds on the diagnostics from Diagnosing instability in production-scale agent reinforcement learning with the engineeri…
Scaling LLM Post-Training at Netflix -- 6 Listen Share Baolin Li, Lingyi Liu, Binh Tang, Shaojing Li Introduction Pre-training gives Large Language Models (LLMs) broad linguistic ability and general world knowledge, but post-training is the phase that actually aligns them to con…
On January 28, 2026, Hugging Face announced that they have upstreamed the Post-Training Toolkit into TRL as a first-party integration, making these diagnostics directly usable in production RL and agent post-training pipelines. This enables closed-loop monitoring and control pat…
Kakao’s development of the Kanana-2 model family represents a strategic shift toward Agentic AI, prioritizing complex reasoning and execution capabilities over simple conversational fluency. By implementing a sophisticated post-training pipeline—including a specialized Mid-training stage and refined reinforcement learning—the team successfully enhanced the model's instruction-following and tool-calling performance. This methodology ensures that the 30B parameter models excel in logical tasks and real-world agentic environments while maintaining high linguistic stability in both English and Korean.
## Mid-training and Catastrophic Forgetting Prevention
* A 250B token Mid-training stage was introduced between Pre-training and Post-training to bridge the gap in reasoning, coding, and tool-calling capabilities.
* The dataset comprised 200B tokens of high-quality reasoning data (Chain-of-Thought math and code) and 50B tokens of "replay" data from the original pre-training set.
* This replay strategy specifically targeted "Catastrophic Forgetting," preventing the model from losing its Korean linguistic nuances and performance on benchmarks like KoMT-bench while it gained English-heavy reasoning skills.
* Experimental results indicated that Mid-training serves as a foundational "force multiplier," leading to faster convergence and higher performance ceilings during subsequent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages.
## Enhanced Instruction Following and Tool Calling
* To optimize for Agentic AI, the developers focused on Instruction Following (IFEval) by synthesizing high-quality, long-form responses that strictly adhere to complex constraints.
* Tool-calling capabilities were improved using "Rejection Sampling" (Iterative SFT), where model-generated trajectories are validated in a real execution environment; only successful outcomes are retained for training.
* The training data was categorized into distinct buckets—such as Chat, Math, Code, and Tool Calling—allowing for a more balanced recipe compared to previous Kanana versions.
* This approach specifically addressed multi-turn and multi-tool scenarios, ensuring the model can handle the recursive logic required for autonomous agents.
## Parallel Reinforcement Learning and Calibration Tuning
* A "Parallel RL" framework was adopted to optimize different capabilities simultaneously: the "Chat" track focused on helpfulness and safety, while the "Logic" track focused on accuracy in math and programming.
* The pipeline moved beyond standard SFT to include Reinforcement Learning from Human Feedback (RLHF), utilizing DPO and PPO-style methods to align the model with human preferences.
* A final "Calibration Tuning" step was implemented to ensure the model’s internal confidence levels match its actual accuracy, effectively reducing hallucinations and improving reliability in technical tasks.
* Comparative benchmarks show that the Kanana-2 Instruct and Thinking models significantly outperform earlier versions and rival larger open-source models in reasoning and coding benchmarks like HumanEval and GSM8K.
The Kanana-2 development cycle demonstrates that achieving "Agentic" performance requires more than just scaling data; it requires a structured transition from general language understanding to execution-verified reasoning. For organizations building AI agents, the Kanana-2 post-training recipe suggests that integrating environment-validated feedback and balancing reasoning data with foundational language "replays" is critical for creating reliable, multi-functional models.