kakao Dec 11, 2025

Korean and Images at Once (opens in new tab)

llm rag contrastive-learning multimodal-embedding vlm matryoshka-representation-learning gradient-caching hard-negative-mining

Kakao has developed Kanana-v-embedding, a specialized multimodal embedding model designed to bridge the gap between Korean text and visual data within a unified semantic space. By leveraging a Vision-Language Model (VLM) framework, the model enables seamless search and recommendation across various combinations of text and images, offering a significant performance boost over existing English-centric models like CLIP. This development provides a robust technical foundation for enhancing Kakao’s services, including RAG-based systems and localized content discovery.

Unified Multimodal Meaning Space

The model maps text and images into a single vector space where semantic similarity is measured via cosine similarity.
Unlike traditional CLIP models that use independent encoders, this architecture treats text and images as a single sequence, allowing for "text + image" combined queries.
It supports four primary interaction modes: Text-to-Text, Text-to-Image, Image-to-Image, and (Text+Image)-to-(Text+Image).

VLM-Based Architecture and Instruction Tuning

The system utilizes a VLM consisting of an LLM and an image encoder, extracting embeddings from the final hidden state of the [EOS] token.
It employs instruction-based query embedding, where specific prompts (e.g., "Find an image matching this caption") guide the model to generate embeddings tailored to the specific task, such as retrieval or classification.
The model is optimized for the Korean language and cultural context, addressing the limitations of previous models that struggled with non-English data.

Advanced Training for Scalability and Precision

Gradient Caching: To overcome GPU memory limitations, this technique allows the model to train with effectively large batch sizes, which is critical for the InfoNCE loss used in contrastive learning.
Matryoshka Representation Learning (MRL): The model supports flexible embedding sizes ranging from 64 to 2,048 dimensions. This allows services to choose between low-latency (smaller dimensions) or high-precision (larger dimensions) without retraining.
Hard Negative Mining: The training process incorporates "hard negatives"—items that are similar but incorrect—to sharpen the model’s ability to distinguish between subtle differences in data.

Performance Benchmarks and Efficiency

Kanana-v-embedding significantly outperforms CLIP and VLM2Vec on the KoEmbed benchmark, particularly in Korean Text-to-Image and Image-to-Text retrieval tasks.
In the M-BEIR (Multimodal Benchmark for Retrieval), the model demonstrated superior performance in multimodal document retrieval and image-to-text tasks compared to established open-source models.
Evaluation of MRL showed that the model retains high accuracy even when dimensions are reduced to 256 or 512, providing a 4x to 8x improvement in storage and search efficiency with minimal loss in quality.

For organizations looking to implement multimodal RAG or advanced recommendation systems in Korean-language environments, Kanana-v-embedding offers a highly adaptable solution. Its ability to balance computational cost and retrieval quality through Matryoshka learning makes it particularly suitable for large-scale production environments where latency is a primary concern.