supervised-fine-tuning

2 posts

kakao

Kanana-2 Development Story (2 (opens in new tab)

Kakao’s development of the Kanana-2 model family represents a strategic shift toward Agentic AI, prioritizing complex reasoning and execution capabilities over simple conversational fluency. By implementing a sophisticated post-training pipeline—including a specialized Mid-training stage and refined reinforcement learning—the team successfully enhanced the model's instruction-following and tool-calling performance. This methodology ensures that the 30B parameter models excel in logical tasks and real-world agentic environments while maintaining high linguistic stability in both English and Korean. ## Mid-training and Catastrophic Forgetting Prevention * A 250B token Mid-training stage was introduced between Pre-training and Post-training to bridge the gap in reasoning, coding, and tool-calling capabilities. * The dataset comprised 200B tokens of high-quality reasoning data (Chain-of-Thought math and code) and 50B tokens of "replay" data from the original pre-training set. * This replay strategy specifically targeted "Catastrophic Forgetting," preventing the model from losing its Korean linguistic nuances and performance on benchmarks like KoMT-bench while it gained English-heavy reasoning skills. * Experimental results indicated that Mid-training serves as a foundational "force multiplier," leading to faster convergence and higher performance ceilings during subsequent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages. ## Enhanced Instruction Following and Tool Calling * To optimize for Agentic AI, the developers focused on Instruction Following (IFEval) by synthesizing high-quality, long-form responses that strictly adhere to complex constraints. * Tool-calling capabilities were improved using "Rejection Sampling" (Iterative SFT), where model-generated trajectories are validated in a real execution environment; only successful outcomes are retained for training. * The training data was categorized into distinct buckets—such as Chat, Math, Code, and Tool Calling—allowing for a more balanced recipe compared to previous Kanana versions. * This approach specifically addressed multi-turn and multi-tool scenarios, ensuring the model can handle the recursive logic required for autonomous agents. ## Parallel Reinforcement Learning and Calibration Tuning * A "Parallel RL" framework was adopted to optimize different capabilities simultaneously: the "Chat" track focused on helpfulness and safety, while the "Logic" track focused on accuracy in math and programming. * The pipeline moved beyond standard SFT to include Reinforcement Learning from Human Feedback (RLHF), utilizing DPO and PPO-style methods to align the model with human preferences. * A final "Calibration Tuning" step was implemented to ensure the model’s internal confidence levels match its actual accuracy, effectively reducing hallucinations and improving reliability in technical tasks. * Comparative benchmarks show that the Kanana-2 Instruct and Thinking models significantly outperform earlier versions and rival larger open-source models in reasoning and coding benchmarks like HumanEval and GSM8K. The Kanana-2 development cycle demonstrates that achieving "Agentic" performance requires more than just scaling data; it requires a structured transition from general language understanding to execution-verified reasoning. For organizations building AI agents, the Kanana-2 post-training recipe suggests that integrating environment-validated feedback and balancing reasoning data with foundational language "replays" is critical for creating reliable, multi-functional models.

netflix

Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix is evolving its recommendation systems by moving beyond simple behavior imitation toward generative recommenders that better align with true user preferences. While generative models like HSTU and OneRec effectively capture sequential user patterns, they often struggle to distinguish between habitual clicks and genuine satisfaction. To bridge this gap, Netflix developed Advantage-Weighted Supervised Fine-tuning (A-SFT), a post-training method that leverages noisy reward signals to refine model performance without the need for complex counterfactual data. ### The Shift to Generative Recommenders * Modern generative recommenders (GRs), such as HSTU and OneRec, utilize transformer architectures to treat recommendation as a sequential transduction task. * The models are typically trained using next-item prediction, where the system learns to imitate the chronological sequence of a user’s activities. * A significant drawback of this "behavior cloning" approach is that it captures external trends and noise rather than long-term user satisfaction, potentially recommending content the user finished but did not actually enjoy. ### Barriers to Reinforcement Learning in RecSys * Traditional post-training methods used in Large Language Models, such as Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO), require counterfactual feedback that is difficult to obtain in recommendation contexts. * Because user sequences span weeks or years, it is impractical to generate and test hypothetical, counterfactual experiences for real-time user validation. * Reward signals in recommendation systems are inherently noisy; for instance, high watch time might indicate interest, but it can also be a result of external circumstances, making it an unreliable metric for optimization. ### Advantage-Weighted Supervised Fine-tuning (A-SFT) * A-SFT is a hybrid approach that sits between offline reinforcement learning and standard supervised fine-tuning. * The algorithm incorporates an advantage function to weight training examples, allowing the model to prioritize actions that lead to higher rewards while filtering out noise from the reward model. * This method is specifically designed to handle high-variance reward signals, using them as directional guides rather than absolute truth, which prevents the model from over-exploiting inaccurate data. * Benchmarks against other representative methods show that A-SFT achieves superior alignment between the generative recommendation policy and the underlying reward model. For organizations managing large-scale recommendation engines, A-SFT offers a practical path to implementing post-training improvements. By focusing on advantage-weighted signals, developers can improve recommendation quality using existing implicit feedback—like watch time and clicks—without the infrastructure hurdles of online reinforcement learning.