instruction-following

2 posts

kakao

Kanana-2 Development Story (2 (opens in new tab)

Kakao’s development of the Kanana-2 model family represents a strategic shift toward Agentic AI, prioritizing complex reasoning and execution capabilities over simple conversational fluency. By implementing a sophisticated post-training pipeline—including a specialized Mid-training stage and refined reinforcement learning—the team successfully enhanced the model's instruction-following and tool-calling performance. This methodology ensures that the 30B parameter models excel in logical tasks and real-world agentic environments while maintaining high linguistic stability in both English and Korean. ## Mid-training and Catastrophic Forgetting Prevention * A 250B token Mid-training stage was introduced between Pre-training and Post-training to bridge the gap in reasoning, coding, and tool-calling capabilities. * The dataset comprised 200B tokens of high-quality reasoning data (Chain-of-Thought math and code) and 50B tokens of "replay" data from the original pre-training set. * This replay strategy specifically targeted "Catastrophic Forgetting," preventing the model from losing its Korean linguistic nuances and performance on benchmarks like KoMT-bench while it gained English-heavy reasoning skills. * Experimental results indicated that Mid-training serves as a foundational "force multiplier," leading to faster convergence and higher performance ceilings during subsequent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages. ## Enhanced Instruction Following and Tool Calling * To optimize for Agentic AI, the developers focused on Instruction Following (IFEval) by synthesizing high-quality, long-form responses that strictly adhere to complex constraints. * Tool-calling capabilities were improved using "Rejection Sampling" (Iterative SFT), where model-generated trajectories are validated in a real execution environment; only successful outcomes are retained for training. * The training data was categorized into distinct buckets—such as Chat, Math, Code, and Tool Calling—allowing for a more balanced recipe compared to previous Kanana versions. * This approach specifically addressed multi-turn and multi-tool scenarios, ensuring the model can handle the recursive logic required for autonomous agents. ## Parallel Reinforcement Learning and Calibration Tuning * A "Parallel RL" framework was adopted to optimize different capabilities simultaneously: the "Chat" track focused on helpfulness and safety, while the "Logic" track focused on accuracy in math and programming. * The pipeline moved beyond standard SFT to include Reinforcement Learning from Human Feedback (RLHF), utilizing DPO and PPO-style methods to align the model with human preferences. * A final "Calibration Tuning" step was implemented to ensure the model’s internal confidence levels match its actual accuracy, effectively reducing hallucinations and improving reliability in technical tasks. * Comparative benchmarks show that the Kanana-2 Instruct and Thinking models significantly outperform earlier versions and rival larger open-source models in reasoning and coding benchmarks like HumanEval and GSM8K. The Kanana-2 development cycle demonstrates that achieving "Agentic" performance requires more than just scaling data; it requires a structured transition from general language understanding to execution-verified reasoning. For organizations building AI agents, the Kanana-2 post-training recipe suggests that integrating environment-validated feedback and balancing reasoning data with foundational language "replays" is critical for creating reliable, multi-functional models.

kakao

The Evolution of Kanana-o (opens in new tab)

Kakao has significantly advanced its integrated multimodal model, Kanana-o, by enhancing its ability to process complex instructions across text, image, and audio inputs while enriching its emotional vocal expression. By developing specialized datasets and sophisticated training techniques for prosody, the team has bridged the performance gap between text and audio modalities. The result is a more natural, human-like AI capable of nuanced interaction and high-performance instruction following, particularly within the Korean linguistic context. ## Advancing Multimodal Instruction Following * Addressed the "modality gap" where multimodal models often show decreased reasoning and reasoning performance when processing audio inputs compared to text. * Constructed a structured, high-quality dataset featuring complex, multi-step instructions such as summarizing a context and then translating it into a specific language or style. * Leveraged the Speech-KoMT-Bench to evaluate performance, showing that Kanana-o significantly outperforms global competitors of similar scale in Korean-specific tasks. * Focused on "Domain-generalization" to ensure the model's core intelligence remains stable regardless of whether the input is text, audio, or a combination of both. ## Image-Audio-Text Modality Alignment * Developed integrated datasets to ensure that reasoning capabilities learned in text-image or text-audio contexts generalize to complex image-audio scenarios. * Trained the model to handle tasks where users ask questions about visual information via voice, requiring the simultaneous alignment of three different data types. * Prioritized the maintenance of "World Knowledge" during multimodal training so that the addition of new modalities does not degrade the model’s factual accuracy. ## Enhancing Vocal Expressiveness and Prosody * Focused on "prosody"—the rhythm, pitch, and stress of speech—to move beyond robotic, flat text-to-speech (TTS) outputs. * Implemented a system of descriptive tokens and emotion tags (e.g., "warm voice," "excited tone") during training to give the model fine-grained control over its vocal persona. * Incorporated natural human speech elements, such as realistic breathing patterns and contextual variations in speech speed, to make interactions feel more intuitive and less synthetic. * Refined the model's ability to interpret the user's emotional state from their voice and respond with a matching emotional intensity. The evolution of Kanana-o highlights a shift from simply maximizing generic benchmarks to optimizing real-world user experiences through multimodal alignment and emotional intelligence. The success of this model underscores the necessity of high-quality, structured instruction data and fine-grained control over output styles to create truly conversational AI that feels natural to the user.