language-model

1 posts

kakao

The Evolution of Kanana-o (opens in new tab)

Kakao has significantly advanced its integrated multimodal model, Kanana-o, by enhancing its ability to process complex instructions across text, image, and audio inputs while enriching its emotional vocal expression. By developing specialized datasets and sophisticated training techniques for prosody, the team has bridged the performance gap between text and audio modalities. The result is a more natural, human-like AI capable of nuanced interaction and high-performance instruction following, particularly within the Korean linguistic context. ## Advancing Multimodal Instruction Following * Addressed the "modality gap" where multimodal models often show decreased reasoning and reasoning performance when processing audio inputs compared to text. * Constructed a structured, high-quality dataset featuring complex, multi-step instructions such as summarizing a context and then translating it into a specific language or style. * Leveraged the Speech-KoMT-Bench to evaluate performance, showing that Kanana-o significantly outperforms global competitors of similar scale in Korean-specific tasks. * Focused on "Domain-generalization" to ensure the model's core intelligence remains stable regardless of whether the input is text, audio, or a combination of both. ## Image-Audio-Text Modality Alignment * Developed integrated datasets to ensure that reasoning capabilities learned in text-image or text-audio contexts generalize to complex image-audio scenarios. * Trained the model to handle tasks where users ask questions about visual information via voice, requiring the simultaneous alignment of three different data types. * Prioritized the maintenance of "World Knowledge" during multimodal training so that the addition of new modalities does not degrade the model’s factual accuracy. ## Enhancing Vocal Expressiveness and Prosody * Focused on "prosody"—the rhythm, pitch, and stress of speech—to move beyond robotic, flat text-to-speech (TTS) outputs. * Implemented a system of descriptive tokens and emotion tags (e.g., "warm voice," "excited tone") during training to give the model fine-grained control over its vocal persona. * Incorporated natural human speech elements, such as realistic breathing patterns and contextual variations in speech speed, to make interactions feel more intuitive and less synthetic. * Refined the model's ability to interpret the user's emotional state from their voice and respond with a matching emotional intensity. The evolution of Kanana-o highlights a shift from simply maximizing generic benchmarks to optimizing real-world user experiences through multimodal alignment and emotional intelligence. The success of this model underscores the necessity of high-quality, structured instruction data and fine-grained control over output styles to create truly conversational AI that feels natural to the user.