당근 / data-engineering

2 posts

daangn

당근의 사용자 행동 로그 관리 플랫폼: 이벤트센터 개발기. 코드로 관리하던 사용자 행동 로그를 플랫폼으로 만든 이유 (opens in new tab)

Daangn transitioned its user behavior log management from a manual, code-based Git workflow to a centralized UI platform called Event Center to improve data consistency and operational efficiency. By automating schema creation and enforcing standardized naming conventions, the platform reduced the technical barriers for developers and analysts while ensuring high data quality for downstream analysis. This transition has streamlined the entire data lifecycle, from collection in the mobile app to structured storage in BigQuery. ### Challenges of Code-Based Schema Management Prior to Event Center, Daangn managed its event schemas—definitions that describe the ownership, domain, and custom parameters of a log—using Git and manual JSON files. This approach created several bottlenecks for the engineering team: * **High Entry Barrier**: Users were required to write complex Spark `StructType` JSON files, which involved managing nested structures and specific metadata fields like `nullable` and `type`. * **Inconsistent Naming**: Without a central enforcement mechanism, event names followed different patterns (e.g., `item_click` vs. `click_item`), making it difficult for analysts to discover relevant data. * **Operational Friction**: Every schema change required a Pull Request (PR), manual review by the data team, and a series of CI checks, leading to slow iteration cycles and frequent communication overhead. ### The User Behavior Log Pipeline To support data-driven decision-making, Daangn employs a robust pipeline that processes millions of events daily through several critical stages: * **Collection and Validation**: Events are sent from the mobile SDK to an event server, which performs initial validation before passing data to GCP Pub/Sub. * **Streaming Processing**: GCP Dataflow handles real-time deduplication, field validation, and data transformation (flattening) to prepare logs for storage. * **Storage and Accessibility**: Data is stored in Google Cloud Storage and BigQuery, where custom parameters defined in the schema are automatically expanded into searchable columns, removing the need for complex JSON parsing in SQL. ### Standardizing Discovery via Event Center The Event Center platform was designed to transform log management into a user-friendly, UI-driven experience while maintaining technical rigor. * **Standardized Naming Conventions**: The platform enforces a strict "Action-Object-Service" naming rule, ensuring that all events are categorized logically across the entire organization. * **Recursive Schema Builder**: To handle the complexity of nested JSON data, the team built a UI component that uses a recursive tree structure, allowing users to define deep data hierarchies without writing code. * **Centralized Dictionary**: The platform serves as a "single source of truth" where any employee can search for events, view their descriptions, and identify the team responsible for specific data points. ### Technical Implementation and Integration The system architecture was built to bridge the gap between a modern web UI and the existing Git-based infrastructure. * **Tech Stack**: The backend is powered by Go (Gin framework) and PostgreSQL (GORM), while the frontend utilizes React, TypeScript, and TanStack Query for state management. * **Automated Git Sync**: When a user saves a schema in Event Center, the system automatically triggers a GitHub Action that generates the necessary JSON files and pushes them to the repository, maintaining the codebase as the ultimate source of truth while abstracting the complexity. * **Real-time Validation**: The UI provides immediate feedback on data types and naming errors, preventing invalid schemas from reaching the production pipeline. Implementing a dedicated log management platform like Event Center is highly recommended for organizations scaling their data operations. Moving away from manual file management to a UI-based system not only reduces the risk of human error but also democratizes data access by allowing non-engineers to define and discover the logs they need for analysis.

daangn

당근은 왜 User Activation을 전사 공통 데이터 레이어로 만들었을까? (opens in new tab)

Daangn transitioned from manually calculating user activation metrics to a centralized "Activation Layer" built on DBT to solve inconsistencies and high operational overhead. By standardizing the definitions of user states and transitions, the team provides a reliable foundation for analyzing why active user counts fluctuate rather than just reporting the final numbers. This common data layer improves data reliability and cost-efficiency while allowing various teams to reuse the same logic for different core user behaviors. ### The Role of User Activation Analysis * While Active User counts show "what" happened, User Activation explains "why" by breaking users down into specific categories. * The system tracks **Activation States**, classifying users as New, Retained, Reactivated, or Inactive at any given time. * It monitors **State Transitions** to identify how users move between categories, such as "New to Retained" or "Reactivated to Inactive." * The layer provides granular behavioral metadata, including continuous activity streaks, the interval between visits, and the duration of churned periods. ### Ensuring Reliability via Fact Models * Raw event logs are often tied to specific UI elements and contain "noise" that makes them unreliable for direct activation analysis. * To ensure consistency, the Activation Layer uses **Fact Models** as its primary input, which are refined datasets where business logic and core behaviors are already defined. * A strict naming convention (`fact_name_activation_time_grain`) is enforced so that users can immediately identify which specific behavior is being analyzed. * This structure ensures that "Active" status is interpreted identically across the entire organization, regardless of which team is performing the analysis. ### Incremental Processing for Cost Efficiency * Calculating the entire history of user activity every day is computationally expensive and leads to high cloud infrastructure costs. * The architecture utilizes a **FirstLast model** to store only the essential metadata for each user: the date of their very first activity and their most recent activity. * By joining daily activity logs with this lightweight FirstLast table, the system can calculate new states and transitions incrementally. * This approach maintains data idempotency and ensures high performance even as the volume of user interaction data grows. ### Scaling with DBT Macros * To support various metrics—such as app visits, item sales, or community posts—the team encapsulated the complex transition logic into **DBT Macros**. * This abstraction allows data engineers to generate a new activation model by simply specifying the source Fact model and the desired time grain (daily, weekly, or monthly). * Centralizing the logic in macros ensures that any bug fixes or improvements to the activation calculation are automatically reflected across all related data models. * The standardized output format allows for the creation of universal dashboards and analysis templates that work for any tracked behavior. Centralizing User Activation logic into a common data layer allows organizations to move beyond surface-level vanity metrics and gain deep, actionable behavioral insights. By combining DBT’s macro capabilities with incremental modeling, teams can maintain high data quality and operational efficiency even as the variety of tracked user behaviors expands.