카카오 / kanana

2 posts

kakao

“생각하고 답변하는” 카카오의 하이브리드 멀티모달 언어모델, Kanana-v-4b-hybrid 개발기 (opens in new tab)

Kakao's Kanana-v-4b-hybrid is a multimodal language model designed to transcend simple image-to-text conversion by integrating logical reasoning and self-verification directly into its response process. By employing a hybrid architecture that handles both intuitive dialogue and complex visual reasoning within a single model, it achieves high accuracy and reliability for sophisticated tasks. This approach allows the model to maintain consistency in user experience while excelling in Korean-specific contexts, as evidenced by its record-breaking 92.8 score on the KoNET evaluation. ### Integrated Hybrid Architecture * Consolidates intuitive tasks (like OCR and summarization) and logical tasks (complex reasoning) into a single model to reduce system complexity and maintenance costs. * Eliminates the need for external routing between specialized models, ensuring a consistent tone, response format, and safety policy throughout a single conversation session. * Utilizes a refined training recipe that balances data ratios and visual reasoning training to ensure that improvements in multimodal understanding benefit all types of user queries. ### Visual Reasoning and Self-Reflection * Follows a natural logic flow: synthesizing information from images and text, applying conditions, verifying candidates, and finally concluding the response. * Features a "Reflection" mechanism where the model actively monitors its own thought process to catch "small but fatal" errors, such as calculation mistakes or missed constraints. * Excels in high-stakes visual tasks like receipt auditing, table filtering, and mathematical problem-solving by double-checking intermediate results against original image data. ### Native Korean Logical Processing * Prioritizes "thinking in Korean" to accurately preserve the nuances of complex constraints, such as "except for X" or "only in cases of Y," which are often lost during internal translation. * Develops a native Korean Rationale process to prevent logical drift, ensuring that the internal reasoning steps remain perfectly aligned with the linguistic structure of the user's query. * Addresses the difficulty of processing information scattered throughout Korean-language documents or exam papers by synthesizing data without language-conversion overhead. Kanana-v-4b-hybrid marks a shift toward "verifiable AI" that provides evidence-based answers rather than just plausible text. For applications in education, finance, or complex document processing, this model offers a blueprint for building trust through transparent reasoning and self-correction.

kakao

더욱 똑똑하게 답하며, 더욱 풍부한 감정표현을 향한 Kanana-o의 진화 과정 (opens in new tab)

Kakao has significantly advanced its integrated multimodal model, Kanana-o, by enhancing its ability to process complex instructions across text, image, and audio inputs while enriching its emotional vocal expression. By developing specialized datasets and sophisticated training techniques for prosody, the team has bridged the performance gap between text and audio modalities. The result is a more natural, human-like AI capable of nuanced interaction and high-performance instruction following, particularly within the Korean linguistic context. ## Advancing Multimodal Instruction Following * Addressed the "modality gap" where multimodal models often show decreased reasoning and reasoning performance when processing audio inputs compared to text. * Constructed a structured, high-quality dataset featuring complex, multi-step instructions such as summarizing a context and then translating it into a specific language or style. * Leveraged the Speech-KoMT-Bench to evaluate performance, showing that Kanana-o significantly outperforms global competitors of similar scale in Korean-specific tasks. * Focused on "Domain-generalization" to ensure the model's core intelligence remains stable regardless of whether the input is text, audio, or a combination of both. ## Image-Audio-Text Modality Alignment * Developed integrated datasets to ensure that reasoning capabilities learned in text-image or text-audio contexts generalize to complex image-audio scenarios. * Trained the model to handle tasks where users ask questions about visual information via voice, requiring the simultaneous alignment of three different data types. * Prioritized the maintenance of "World Knowledge" during multimodal training so that the addition of new modalities does not degrade the model’s factual accuracy. ## Enhancing Vocal Expressiveness and Prosody * Focused on "prosody"—the rhythm, pitch, and stress of speech—to move beyond robotic, flat text-to-speech (TTS) outputs. * Implemented a system of descriptive tokens and emotion tags (e.g., "warm voice," "excited tone") during training to give the model fine-grained control over its vocal persona. * Incorporated natural human speech elements, such as realistic breathing patterns and contextual variations in speech speed, to make interactions feel more intuitive and less synthetic. * Refined the model's ability to interpret the user's emotional state from their voice and respond with a matching emotional intensity. The evolution of Kanana-o highlights a shift from simply maximizing generic benchmarks to optimizing real-world user experiences through multimodal alignment and emotional intelligence. The success of this model underscores the necessity of high-quality, structured instruction data and fine-grained control over output styles to create truly conversational AI that feels natural to the user.