kakao

Korean and Images at Once (opens in new tab)

Kakao has developed Kanana-v-embedding, a specialized multimodal embedding model designed to bridge the gap between Korean text and visual data within a unified semantic space. By leveraging a Vision-Language Model (VLM) framework, the model enables seamless search and recommendation across various combinations of text and images, offering a significant performance boost over existing English-centric models like CLIP. This development provides a robust technical foundation for enhancing Kakao’s services, including RAG-based systems and localized content discovery.

Unified Multimodal Meaning Space

  • The model maps text and images into a single vector space where semantic similarity is measured via cosine similarity.
  • Unlike traditional CLIP models that use independent encoders, this architecture treats text and images as a single sequence, allowing for "text + image" combined queries.
  • It supports four primary interaction modes: Text-to-Text, Text-to-Image, Image-to-Image, and (Text+Image)-to-(Text+Image).

VLM-Based Architecture and Instruction Tuning

  • The system utilizes a VLM consisting of an LLM and an image encoder, extracting embeddings from the final hidden state of the [EOS] token.
  • It employs instruction-based query embedding, where specific prompts (e.g., "Find an image matching this caption") guide the model to generate embeddings tailored to the specific task, such as retrieval or classification.
  • The model is optimized for the Korean language and cultural context, addressing the limitations of previous models that struggled with non-English data.

Advanced Training for Scalability and Precision

  • Gradient Caching: To overcome GPU memory limitations, this technique allows the model to train with effectively large batch sizes, which is critical for the InfoNCE loss used in contrastive learning.
  • Matryoshka Representation Learning (MRL): The model supports flexible embedding sizes ranging from 64 to 2,048 dimensions. This allows services to choose between low-latency (smaller dimensions) or high-precision (larger dimensions) without retraining.
  • Hard Negative Mining: The training process incorporates "hard negatives"—items that are similar but incorrect—to sharpen the model’s ability to distinguish between subtle differences in data.

Performance Benchmarks and Efficiency

  • Kanana-v-embedding significantly outperforms CLIP and VLM2Vec on the KoEmbed benchmark, particularly in Korean Text-to-Image and Image-to-Text retrieval tasks.
  • In the M-BEIR (Multimodal Benchmark for Retrieval), the model demonstrated superior performance in multimodal document retrieval and image-to-text tasks compared to established open-source models.
  • Evaluation of MRL showed that the model retains high accuracy even when dimensions are reduced to 256 or 512, providing a 4x to 8x improvement in storage and search efficiency with minimal loss in quality.

For organizations looking to implement multimodal RAG or advanced recommendation systems in Korean-language environments, Kanana-v-embedding offers a highly adaptable solution. Its ability to balance computational cost and retrieval quality through Matryoshka learning makes it particularly suitable for large-scale production environments where latency is a primary concern.