‘로컬’ 슈퍼 앱에서 장기 유저 모델링은 어떻게 달라질까? -- Share 안녕하세요! 당근 ML Foundation 팀에서 ML Engineer로 일하고 있는 Hawke와 Ben.Kim이에요. 저희 팀은 개인화 추천 개선을 위한 “기반 기술”을 만드는 역할을 하고 있어요. 이 글에서는 유저의 장기 행동 로그를 Transformer로 학습해 유저 임베딩을 만들고, 홈피드·광고 등 다양한 추천 모델에 적용해 큰 폭의 온라인 지표 개선을 달성한 여정을 공유하려고 해요. 왜 장기 유저 모델링이 필요할…
Kakao has developed Kanana-v-embedding, a specialized multimodal embedding model designed to bridge the gap between Korean text and visual data within a unified semantic space. By leveraging a Vision-Language Model (VLM) framework, the model enables seamless search and recommendation across various combinations of text and images, offering a significant performance boost over existing English-centric models like CLIP. This development provides a robust technical foundation for enhancing Kakao’s services, including RAG-based systems and localized content discovery.
### Unified Multimodal Meaning Space
* The model maps text and images into a single vector space where semantic similarity is measured via cosine similarity.
* Unlike traditional CLIP models that use independent encoders, this architecture treats text and images as a single sequence, allowing for "text + image" combined queries.
* It supports four primary interaction modes: Text-to-Text, Text-to-Image, Image-to-Image, and (Text+Image)-to-(Text+Image).
### VLM-Based Architecture and Instruction Tuning
* The system utilizes a VLM consisting of an LLM and an image encoder, extracting embeddings from the final hidden state of the [EOS] token.
* It employs instruction-based query embedding, where specific prompts (e.g., "Find an image matching this caption") guide the model to generate embeddings tailored to the specific task, such as retrieval or classification.
* The model is optimized for the Korean language and cultural context, addressing the limitations of previous models that struggled with non-English data.
### Advanced Training for Scalability and Precision
* **Gradient Caching:** To overcome GPU memory limitations, this technique allows the model to train with effectively large batch sizes, which is critical for the InfoNCE loss used in contrastive learning.
* **Matryoshka Representation Learning (MRL):** The model supports flexible embedding sizes ranging from 64 to 2,048 dimensions. This allows services to choose between low-latency (smaller dimensions) or high-precision (larger dimensions) without retraining.
* **Hard Negative Mining:** The training process incorporates "hard negatives"—items that are similar but incorrect—to sharpen the model’s ability to distinguish between subtle differences in data.
### Performance Benchmarks and Efficiency
* Kanana-v-embedding significantly outperforms CLIP and VLM2Vec on the KoEmbed benchmark, particularly in Korean Text-to-Image and Image-to-Text retrieval tasks.
* In the M-BEIR (Multimodal Benchmark for Retrieval), the model demonstrated superior performance in multimodal document retrieval and image-to-text tasks compared to established open-source models.
* Evaluation of MRL showed that the model retains high accuracy even when dimensions are reduced to 256 or 512, providing a 4x to 8x improvement in storage and search efficiency with minimal loss in quality.
For organizations looking to implement multimodal RAG or advanced recommendation systems in Korean-language environments, Kanana-v-embedding offers a highly adaptable solution. Its ability to balance computational cost and retrieval quality through Matryoshka learning makes it particularly suitable for large-scale production environments where latency is a primary concern.
SensorLM is a new family of foundation models designed to bridge the gap between high-dimensional wearable sensor data and natural language descriptions. By training on a massive dataset of nearly 60 million hours of de-identified health data, the models learn to interpret complex physiological signals to provide meaningful context for human activities. This research demonstrates that integrating multimodal sensor signals with language models enables sophisticated health insights, such as zero-shot activity recognition and automated health captioning, that significantly outperform general-purpose large language models.
## Dataset Scale and Automated Annotation
* The models were pre-trained on an unprecedented 59.7 million hours of multimodal sensor data collected from over 103,000 individuals across 127 countries.
* To overcome the high cost of manual annotation, researchers developed a hierarchical pipeline that automatically generates text descriptions by calculating statistics and identifying trends within the raw sensor streams.
* Data was sourced from Fitbit and Pixel Watch devices, representing nearly 2.5 million person-days of activity and health information.
## Hybrid Training Architecture
* SensorLM unifies two primary multimodal strategies: contrastive learning and generative pre-training.
* Through contrastive learning, the model learns to discriminate between different states—such as a "light swim" versus a "strength workout"—by matching sensor segments to corresponding text descriptions.
* The generative component allows the model to "speak" for the sensors, producing nuanced, context-aware natural language captions directly from high-dimensional biometric signals.
## Activity Recognition and Cross-Modal Capabilities
* The model demonstrates state-of-the-art performance in zero-shot human activity recognition, accurately classifying 20 different activities without any specific fine-tuning.
* Its few-shot learning capabilities allow the model to adapt to new tasks or individual user patterns with only a handful of examples.
* SensorLM facilitates cross-modal retrieval, enabling users or experts to find specific sensor patterns using natural language queries or to generate descriptions based on specific sensor inputs.
## Generative Health Captioning
* Beyond simple classification, the model can generate hierarchical captions that describe the statistical, structural, and semantic dimensions of a user’s data.
* Experimental results using metrics like BERTScore show that SensorLM produces captions that are more factually correct and coherent than those created by powerful non-specialist LLMs.
* This capability allows for the translation of abstract data points, such as heart rate variability or step counts, into readable summaries that explain the "why" behind physiological changes.
By providing a framework where wearable data can be understood through the lens of human language, SensorLM paves the way for more intuitive and personalized health monitoring. This technology holds the potential to transform raw biometric streams into actionable insights, helping users better understand the relationship between their activities and their overall physical well-being.