reinforcement-learning

5 posts

kakao

Kanana-2 Development Story (2 (opens in new tab)

Kakao’s development of the Kanana-2 model family represents a strategic shift toward Agentic AI, prioritizing complex reasoning and execution capabilities over simple conversational fluency. By implementing a sophisticated post-training pipeline—including a specialized Mid-training stage and refined reinforcement learning—the team successfully enhanced the model's instruction-following and tool-calling performance. This methodology ensures that the 30B parameter models excel in logical tasks and real-world agentic environments while maintaining high linguistic stability in both English and Korean. ## Mid-training and Catastrophic Forgetting Prevention * A 250B token Mid-training stage was introduced between Pre-training and Post-training to bridge the gap in reasoning, coding, and tool-calling capabilities. * The dataset comprised 200B tokens of high-quality reasoning data (Chain-of-Thought math and code) and 50B tokens of "replay" data from the original pre-training set. * This replay strategy specifically targeted "Catastrophic Forgetting," preventing the model from losing its Korean linguistic nuances and performance on benchmarks like KoMT-bench while it gained English-heavy reasoning skills. * Experimental results indicated that Mid-training serves as a foundational "force multiplier," leading to faster convergence and higher performance ceilings during subsequent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages. ## Enhanced Instruction Following and Tool Calling * To optimize for Agentic AI, the developers focused on Instruction Following (IFEval) by synthesizing high-quality, long-form responses that strictly adhere to complex constraints. * Tool-calling capabilities were improved using "Rejection Sampling" (Iterative SFT), where model-generated trajectories are validated in a real execution environment; only successful outcomes are retained for training. * The training data was categorized into distinct buckets—such as Chat, Math, Code, and Tool Calling—allowing for a more balanced recipe compared to previous Kanana versions. * This approach specifically addressed multi-turn and multi-tool scenarios, ensuring the model can handle the recursive logic required for autonomous agents. ## Parallel Reinforcement Learning and Calibration Tuning * A "Parallel RL" framework was adopted to optimize different capabilities simultaneously: the "Chat" track focused on helpfulness and safety, while the "Logic" track focused on accuracy in math and programming. * The pipeline moved beyond standard SFT to include Reinforcement Learning from Human Feedback (RLHF), utilizing DPO and PPO-style methods to align the model with human preferences. * A final "Calibration Tuning" step was implemented to ensure the model’s internal confidence levels match its actual accuracy, effectively reducing hallucinations and improving reliability in technical tasks. * Comparative benchmarks show that the Kanana-2 Instruct and Thinking models significantly outperform earlier versions and rival larger open-source models in reasoning and coding benchmarks like HumanEval and GSM8K. The Kanana-2 development cycle demonstrates that achieving "Agentic" performance requires more than just scaling data; it requires a structured transition from general language understanding to execution-verified reasoning. For organizations building AI agents, the Kanana-2 post-training recipe suggests that integrating environment-validated feedback and balancing reasoning data with foundational language "replays" is critical for creating reliable, multi-functional models.

aws

Amazon Bedrock adds reinforcement fine-tuning simplifying how developers build smarter, more accurate AI models | AWS News Blog (opens in new tab)

Amazon Bedrock has introduced reinforcement fine-tuning, a new model customization capability that allows developers to build more accurate and cost-effective AI models using feedback-driven training. By moving away from the requirement for massive labeled datasets in favor of reward signals, the platform enables average accuracy gains of 66% while automating the complex infrastructure typically associated with advanced machine learning. This approach allows organizations to optimize smaller, faster models for specific business needs without sacrificing performance or incurring the high costs of larger model variants. **Challenges of Traditional Model Customization** * Traditional fine-tuning often requires massive, high-quality labeled datasets and expensive human annotation, which can be a significant barrier for many organizations. * Developers previously had to choose between settle for generic "out-of-the-box" results or managing the high costs and complexity of large-scale infrastructure. * The high barrier to entry for advanced reinforcement learning techniques often required specialized ML expertise that many development teams lack. **Mechanics of Reinforcement Fine-Tuning** * The system uses an iterative feedback loop where models improve based on reward signals that judge the quality of responses against specific business requirements. * Reinforcement Learning with Verifiable Rewards (RLVR) utilizes rule-based graders to provide objective feedback for tasks such as mathematics or code generation. * Reinforcement Learning from AI Feedback (RLAIF) uses AI-driven evaluations to help models understand preference and quality without manual human intervention. * The workflow can be powered by existing API logs within Amazon Bedrock or by uploading training datasets, eliminating the need for complex infrastructure setup. **Performance and Security Advantages** * The technique achieves an average accuracy improvement of 66% over base models, enabling smaller models to perform at the level of much larger alternatives. * Current support includes the Amazon Nova 2 Lite model, which helps developers optimize for both speed and price-to-performance. * All training data and customization processes remain within the secure AWS environment, ensuring that proprietary data is protected and compliant with organizational security standards. Developers should consider reinforcement fine-tuning as a primary strategy for optimizing smaller models like Amazon Nova 2 Lite to achieve high-tier performance at a lower cost. This capability is particularly recommended for specialized tasks like reasoning and coding where objective reward functions can be used to rapidly iterate and improve model accuracy.

aws

New serverless customization in Amazon SageMaker AI accelerates model fine-tuning | AWS News Blog (opens in new tab)

Amazon SageMaker AI has introduced a new serverless customization capability designed to accelerate the fine-tuning of popular models like Llama, DeepSeek, and Amazon Nova. By automating resource provisioning and providing an intuitive interface for advanced reinforcement learning techniques, this feature reduces the model customization lifecycle from months to days. This end-to-end workflow allows developers to focus on model performance rather than infrastructure management, from initial training through to final deployment. **Automated Infrastructure and Model Support** * The service provides a serverless environment where SageMaker AI automatically selects and provisions compute resources based on the specific model architecture and dataset size. * Supported models include a broad range of high-performance options such as Amazon Nova, DeepSeek, GPT-OSS, Meta Llama, and Qwen. * The feature is accessible directly through the Amazon SageMaker Studio interface, allowing users to manage their entire model catalog in one location. **Advanced Customization and Reinforcement Learning** * Users can choose from several fine-tuning techniques, including traditional Supervised Fine-Tuning (SFT) and more advanced methods. * The platform supports modern optimization techniques such as Direct Preference Optimization (DPO), Reinforcement Learning from Verifiable Rewards (RLVR), and Reinforcement Learning from AI Feedback (RLAIF). * To simplify the process, SageMaker AI provides recommended defaults for hyperparameters like batch size, learning rate, and epochs based on the selected tuning technique. **Experiment Tracking and Security** * The workflow introduces a serverless MLflow application, enabling seamless experiment tracking and performance monitoring without additional setup. * Advanced configuration options allow for fine-grained control over network encryption and storage volume encryption to ensure data security. * The "Continue customization" feature allows for iterative tuning, where users can adjust hyperparameters or apply different techniques to an existing customized model. **Evaluation and Deployment Flexibility** * Built-in evaluation tools allow developers to compare the performance of their customized models against the original base models to verify improvements. * Once a model is finalized, it can be deployed with a few clicks to either Amazon SageMaker or Amazon Bedrock. * A centralized "My Models" dashboard tracks all custom iterations, providing detailed logs and status updates for every training and evaluation job. This serverless approach is highly recommended for teams that need to adapt large language models to specific domains quickly without the operational overhead of managing GPU clusters. By utilizing the integrated evaluation and multi-platform deployment options, organizations can transition from experimentation to production-ready AI more efficiently.

netflix

Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix is evolving its recommendation systems by moving beyond simple behavior imitation toward generative recommenders that better align with true user preferences. While generative models like HSTU and OneRec effectively capture sequential user patterns, they often struggle to distinguish between habitual clicks and genuine satisfaction. To bridge this gap, Netflix developed Advantage-Weighted Supervised Fine-tuning (A-SFT), a post-training method that leverages noisy reward signals to refine model performance without the need for complex counterfactual data. ### The Shift to Generative Recommenders * Modern generative recommenders (GRs), such as HSTU and OneRec, utilize transformer architectures to treat recommendation as a sequential transduction task. * The models are typically trained using next-item prediction, where the system learns to imitate the chronological sequence of a user’s activities. * A significant drawback of this "behavior cloning" approach is that it captures external trends and noise rather than long-term user satisfaction, potentially recommending content the user finished but did not actually enjoy. ### Barriers to Reinforcement Learning in RecSys * Traditional post-training methods used in Large Language Models, such as Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO), require counterfactual feedback that is difficult to obtain in recommendation contexts. * Because user sequences span weeks or years, it is impractical to generate and test hypothetical, counterfactual experiences for real-time user validation. * Reward signals in recommendation systems are inherently noisy; for instance, high watch time might indicate interest, but it can also be a result of external circumstances, making it an unreliable metric for optimization. ### Advantage-Weighted Supervised Fine-tuning (A-SFT) * A-SFT is a hybrid approach that sits between offline reinforcement learning and standard supervised fine-tuning. * The algorithm incorporates an advantage function to weight training examples, allowing the model to prioritize actions that lead to higher rewards while filtering out noise from the reward model. * This method is specifically designed to handle high-variance reward signals, using them as directional guides rather than absolute truth, which prevents the model from over-exploiting inaccurate data. * Benchmarks against other representative methods show that A-SFT achieves superior alignment between the generative recommendation policy and the underlying reward model. For organizations managing large-scale recommendation engines, A-SFT offers a practical path to implementing post-training improvements. By focusing on advantage-weighted signals, developers can improve recommendation quality using existing implicit feedback—like watch time and clicks—without the infrastructure hurdles of online reinforcement learning.

google

A collaborative approach to image generation (opens in new tab)

Google Research has introduced PASTA (Preference Adaptive and Sequential Text-to-image Agent), a reinforcement learning agent designed to transform image generation from a single-prompt task into a collaborative, multi-turn dialogue. By learning individual user preferences through sequential interactions, the system eliminates the frustration of trial-and-error prompting to achieve a specific creative vision. ## Data Strategy and User Simulation * Researchers collected a foundational dataset featuring over 7,000 human interactions, using Gemini Flash for prompt expansion and Stable Diffusion XL (SDXL) for image generation. * To overcome the scarcity of real-world interaction data, the team developed a user simulator that generated over 30,000 additional interaction trajectories. * The simulator is built on two primary components: a utility model that predicts how much a user will like an image, and a choice model that predicts which image a user will select from a given set. ## Latent Preference Discovery * The architecture utilizes pre-trained CLIP encoders paired with user-specific components to capture nuanced aesthetic tastes. * An expectation-maximization (EM) algorithm is employed to identify "user types," allowing the system to cluster users with similar interests, such as a preference for specific artistic styles or subject matter like "Food" or "Animals." * This approach enables the model to generalize preferences quickly, allowing it to adapt to new users based on minimal initial feedback. ## The Collaborative Generation Loop * PASTA operates as a value-based reinforcement learning model that aims to maximize cumulative user satisfaction across an entire interaction session. * The workflow begins with a candidate generator creating diverse prompt expansions; a candidate selector then picks an optimal "slate" of four variations to present to the user. * Each user selection provides a feedback signal that guides the agent’s next set of suggestions, iteratively narrowing the gap between the generated output and the user's intent. ## Training and Performance Validation * The agent was trained using Implicit Q-learning (IQL) to optimize decision-making without requiring online interaction during the training phase. * Performance was measured using several metrics, including Pick-a-Pic accuracy, Spearman’s rank correlation, and cross-turn accuracy. * Results indicated that agents trained on a combination of real-world and simulated data significantly outperformed baseline models and versions trained on only one data type. PASTA demonstrates that integrating iterative feedback loops and reinforcement learning can effectively bridge the "intent gap" in generative AI. For developers building creative tools, this research suggests that move-away from static prompting toward adaptive, simulation-trained agents can provide a more satisfying and intuitive user experience.