‘로컬’ 슈퍼 앱에서 장기 유저 모델링은 어떻게 달라질까? -- Share 안녕하세요! 당근 ML Foundation 팀에서 ML Engineer로 일하고 있는 Hawke와 Ben.Kim이에요. 저희 팀은 개인화 추천 개선을 위한 “기반 기술”을 만드는 역할을 하고 있어요. 이 글에서는 유저의 장기 행동 로그를 Transformer로 학습해 유저 임베딩을 만들고, 홈피드·광고 등 다양한 추천 모델에 적용해 큰 폭의 온라인 지표 개선을 달성한 여정을 공유하려고 해요. 왜 장기 유저 모델링이 필요할…
Daangn transitioned its user behavior log management from a manual, code-based Git workflow to a centralized UI platform called Event Center to improve data consistency and operational efficiency. By automating schema creation and enforcing standardized naming conventions, the platform reduced the technical barriers for developers and analysts while ensuring high data quality for downstream analysis. This transition has streamlined the entire data lifecycle, from collection in the mobile app to structured storage in BigQuery.
### Challenges of Code-Based Schema Management
Prior to Event Center, Daangn managed its event schemas—definitions that describe the ownership, domain, and custom parameters of a log—using Git and manual JSON files. This approach created several bottlenecks for the engineering team:
* **High Entry Barrier**: Users were required to write complex Spark `StructType` JSON files, which involved managing nested structures and specific metadata fields like `nullable` and `type`.
* **Inconsistent Naming**: Without a central enforcement mechanism, event names followed different patterns (e.g., `item_click` vs. `click_item`), making it difficult for analysts to discover relevant data.
* **Operational Friction**: Every schema change required a Pull Request (PR), manual review by the data team, and a series of CI checks, leading to slow iteration cycles and frequent communication overhead.
### The User Behavior Log Pipeline
To support data-driven decision-making, Daangn employs a robust pipeline that processes millions of events daily through several critical stages:
* **Collection and Validation**: Events are sent from the mobile SDK to an event server, which performs initial validation before passing data to GCP Pub/Sub.
* **Streaming Processing**: GCP Dataflow handles real-time deduplication, field validation, and data transformation (flattening) to prepare logs for storage.
* **Storage and Accessibility**: Data is stored in Google Cloud Storage and BigQuery, where custom parameters defined in the schema are automatically expanded into searchable columns, removing the need for complex JSON parsing in SQL.
### Standardizing Discovery via Event Center
The Event Center platform was designed to transform log management into a user-friendly, UI-driven experience while maintaining technical rigor.
* **Standardized Naming Conventions**: The platform enforces a strict "Action-Object-Service" naming rule, ensuring that all events are categorized logically across the entire organization.
* **Recursive Schema Builder**: To handle the complexity of nested JSON data, the team built a UI component that uses a recursive tree structure, allowing users to define deep data hierarchies without writing code.
* **Centralized Dictionary**: The platform serves as a "single source of truth" where any employee can search for events, view their descriptions, and identify the team responsible for specific data points.
### Technical Implementation and Integration
The system architecture was built to bridge the gap between a modern web UI and the existing Git-based infrastructure.
* **Tech Stack**: The backend is powered by Go (Gin framework) and PostgreSQL (GORM), while the frontend utilizes React, TypeScript, and TanStack Query for state management.
* **Automated Git Sync**: When a user saves a schema in Event Center, the system automatically triggers a GitHub Action that generates the necessary JSON files and pushes them to the repository, maintaining the codebase as the ultimate source of truth while abstracting the complexity.
* **Real-time Validation**: The UI provides immediate feedback on data types and naming errors, preventing invalid schemas from reaching the production pipeline.
Implementing a dedicated log management platform like Event Center is highly recommended for organizations scaling their data operations. Moving away from manual file management to a UI-based system not only reduces the risk of human error but also democratizes data access by allowing non-engineers to define and discover the logs they need for analysis.