당근 / bigquery

5 posts

daangn

당근의 사용자 행동 로그 관리 플랫폼: 이벤트센터 개발기. 코드로 관리하던 사용자 행동 로그를 플랫폼으로 만든 이유 (opens in new tab)

Daangn transitioned its user behavior log management from a manual, code-based Git workflow to a centralized UI platform called Event Center to improve data consistency and operational efficiency. By automating schema creation and enforcing standardized naming conventions, the platform reduced the technical barriers for developers and analysts while ensuring high data quality for downstream analysis. This transition has streamlined the entire data lifecycle, from collection in the mobile app to structured storage in BigQuery. ### Challenges of Code-Based Schema Management Prior to Event Center, Daangn managed its event schemas—definitions that describe the ownership, domain, and custom parameters of a log—using Git and manual JSON files. This approach created several bottlenecks for the engineering team: * **High Entry Barrier**: Users were required to write complex Spark `StructType` JSON files, which involved managing nested structures and specific metadata fields like `nullable` and `type`. * **Inconsistent Naming**: Without a central enforcement mechanism, event names followed different patterns (e.g., `item_click` vs. `click_item`), making it difficult for analysts to discover relevant data. * **Operational Friction**: Every schema change required a Pull Request (PR), manual review by the data team, and a series of CI checks, leading to slow iteration cycles and frequent communication overhead. ### The User Behavior Log Pipeline To support data-driven decision-making, Daangn employs a robust pipeline that processes millions of events daily through several critical stages: * **Collection and Validation**: Events are sent from the mobile SDK to an event server, which performs initial validation before passing data to GCP Pub/Sub. * **Streaming Processing**: GCP Dataflow handles real-time deduplication, field validation, and data transformation (flattening) to prepare logs for storage. * **Storage and Accessibility**: Data is stored in Google Cloud Storage and BigQuery, where custom parameters defined in the schema are automatically expanded into searchable columns, removing the need for complex JSON parsing in SQL. ### Standardizing Discovery via Event Center The Event Center platform was designed to transform log management into a user-friendly, UI-driven experience while maintaining technical rigor. * **Standardized Naming Conventions**: The platform enforces a strict "Action-Object-Service" naming rule, ensuring that all events are categorized logically across the entire organization. * **Recursive Schema Builder**: To handle the complexity of nested JSON data, the team built a UI component that uses a recursive tree structure, allowing users to define deep data hierarchies without writing code. * **Centralized Dictionary**: The platform serves as a "single source of truth" where any employee can search for events, view their descriptions, and identify the team responsible for specific data points. ### Technical Implementation and Integration The system architecture was built to bridge the gap between a modern web UI and the existing Git-based infrastructure. * **Tech Stack**: The backend is powered by Go (Gin framework) and PostgreSQL (GORM), while the frontend utilizes React, TypeScript, and TanStack Query for state management. * **Automated Git Sync**: When a user saves a schema in Event Center, the system automatically triggers a GitHub Action that generates the necessary JSON files and pushes them to the repository, maintaining the codebase as the ultimate source of truth while abstracting the complexity. * **Real-time Validation**: The UI provides immediate feedback on data types and naming errors, preventing invalid schemas from reaching the production pipeline. Implementing a dedicated log management platform like Event Center is highly recommended for organizations scaling their data operations. Moving away from manual file management to a UI-based system not only reduces the risk of human error but also democratizes data access by allowing non-engineers to define and discover the logs they need for analysis.

daangn

당근페이 AI Powered FDS로 가는 여정: 룰엔진구축부터 LLM 적용까지 (opens in new tab)

Daangn Pay has evolved its Fraud Detection System (FDS) from a traditional rule-based architecture to a sophisticated AI-powered framework to better protect user assets and combat evolving financial scams. By implementing a modular rule engine and integrating Large Language Models (LLMs), the platform has significantly reduced manual review times and improved its response to emerging fraud trends. This transition allows for consistent, context-aware risk assessment while maintaining compliance with strict financial regulations. ### Modular Rule Engine Architecture * The system is built on a "Lego-like" structure consisting of three components: Conditions (basic units like account age or transfer frequency), Rules (logical combinations of conditions), and Policies (groups of rules with specific sanction levels). * This modularity allows non-developers to adjust thresholds—such as changing a "30-day membership" requirement to "70 days"—in real-time to respond to sudden shifts in fraud patterns. * Data flows through two distinct paths: a Synchronous API for immediate blocking decisions (e.g., during a live transfer) and an Asynchronous Stream for high-volume, real-time monitoring where slight latency is acceptable. ### Risk Evaluation and Post-Processing * Events undergo a structured pipeline beginning with ingestion, followed by multi-layered evaluation through the rule engine to determine the final risk score. * The post-processing phase incorporates LLM analysis to evaluate behavioral context, which is then used to trigger alerts for human operators or apply automated user sanctions. * Implementation of this engine led to a measurable decrease in information requests from financial and investigative authorities, indicating a higher rate of internal prevention. ### LLM Integration for Contextual Analysis * To solve the inconsistency and time lag of manual reviews—which previously took between 5 and 20 minutes per case—Daangn Pay integrated Claude 3.5 Sonnet via AWS Bedrock. * The system overcomes strict financial "network isolation" regulations by utilizing an "Innovative Financial Service" designation, allowing the use of cloud-based generative AI within a regulated environment. * The technical implementation uses a specialized data collector that pulls fraud history from BigQuery into a Redis cache to build structured, multi-step prompts for the LLM. * The AI provides evaluations in a structured JSON format, assessing whether a transaction is fraudulent based on specific criteria and providing the reasoning behind the decision. The combination of a flexible, rule-based foundation and context-aware LLM analysis demonstrates how fintech companies can scale security operations. For organizations facing high-volume fraud, the modular approach ensures immediate technical agility, while AI integration provides the nuanced judgment necessary to handle complex social engineering tactics.

daangn

당근 데이터 지도를 그리다: 컬럼 레벨 리니지 구축기 (opens in new tab)

Daangn’s data governance team addressed the lack of transparency in their data pipelines by building a column-level lineage system using SQL parsing. By analyzing BigQuery query logs with specialized parsing tools, they successfully mapped intricate data dependencies that standard table-level tracking could not capture. This system now enables precise impact analysis and significantly improves data reliability and troubleshooting speed across the organization. **The Necessity of Column-Level Visibility** * Table-level lineage, while easily accessible via BigQuery’s `JOBS` view, fails to identify how specific fields—such as PII or calculated metrics—propagate through downstream systems. * Without granular lineage, the team faced "cascading failures" where a single pipeline error triggered a chain of broken tables that were difficult to trace manually. * Schema migrations, such as modifying a source MySQL column, were historically high-risk because the impact on derivative BigQuery tables and columns was unknown. **Evaluating Extraction Strategies** * BigQuery’s native `INFORMATION_SCHEMA` was found to be insufficient because it does not support column-level detail and often obscures original source tables when Views are involved. * Frameworks like OpenLineage were considered but rejected due to high operational costs; requiring every team to instrument their own Airflow jobs or notebooks was deemed impractical for a central governance team. * The team chose a centralized SQL parsing approach, leveraging the fact that nearly all data transformations within the company are executed as SQL queries within BigQuery. **Technical Implementation and Tech Stack** * **sqlglot:** This library serves as the core engine, parsing SQL strings into Abstract Syntax Trees (AST) to programmatically identify source and destination columns. * **Data Collection:** The system pulls raw query text from `INFORMATION_SCHEMA.JOBS` across all Google Cloud projects to ensure comprehensive coverage. * **Processing and Orchestration:** Spark is utilized to handle the parallel processing of massive query logs, while Airflow schedules regular updates to the lineage data. * **Storage:** The resulting mappings are stored in a centralized BigQuery table (`data_catalog.lineage`), making the dependency map easily accessible for impact analysis and data cataloging. By centralizing lineage extraction through SQL parsing rather than per-job instrumentation, organizations can achieve comprehensive visibility without placing an integration burden on individual developers. This approach is highly effective for BigQuery-centric environments where SQL is the primary language for data movement and transformation.

daangn

매번 다 퍼올 필요 없잖아? 당근의 MongoDB CDC 구축기 (opens in new tab)

As Daangn’s data volume grew, their traditional full-dump approach using Spark for MongoDB began causing significant CPU spikes and failing to meet the two-hour data delivery Service Level Objectives (SLOs). To resolve this, the team implemented a Change Data Capture (CDC) pipeline using Flink CDC to synchronize data efficiently without the need for resource-intensive full table scans. This transition successfully stabilized database performance and ensured timely data availability in BigQuery by focusing on incremental change logs rather than repeated bulk extracts. ### Limitations of Traditional Dump Methods * The previous Spark Connector method required full table scans, creating a direct conflict between service stability and data freshness. * Attempts to lower DB load resulted in missing the 2-hour SLO, while meeting the SLO pushed CPU usage to dangerous levels. * Standard incremental loading was ruled out because it relied on `updated_at` fields, which were not consistently updated across all business logic or schemas. * The team targeted the top five largest and most frequently updated collections for the initial CDC transition to maximize performance gains. ### Advantages of Flink CDC * Flink CDC provides native support for MongoDB Change Streams, allowing the system to use resume tokens and Flink checkpoints for seamless recovery after failures. * It guarantees "Exactly-Once" processing by periodically saving the pipeline state to distributed storage, ensuring data integrity during restarts. * Unlike tools like Debezium that require separate systems for data processing, Flink handles the entire "Extract-Transform-Load" (ETL) lifecycle within a single job. * The architecture is horizontally scalable; increasing the number of TaskManagers allows the pipeline to handle surges in event volume with linear performance improvements. ### Pipeline Architecture and Implementation * The system utilizes the MongoDB Oplog to capture real-time write operations (inserts, updates, and deletes) which are then processed by Flink. * The backend pipeline operates on an hourly batch cycle to extract the latest change events, deduplicate them, and merge them into raw JSON tables in BigQuery. * A "Schema Evolution" step automatically detects and adds missing fields to BigQuery tables, bridging the gap between NoSQL flexibility and SQL structure. * While Flink captures data in real-time, the team opted for hourly materialization to maintain idempotency, simplify error recovery, and meet existing business requirements without unnecessary architectural complexity. For organizations managing large-scale MongoDB instances, moving from bulk extracts to a CDC-based model is a critical step in balancing database health with analytical needs. Implementing a unified framework like Flink CDC not only reduces the load on operational databases but also simplifies the management of complex data transformations and schema changes.

daangn

매번 다 퍼올 필요 없잖아? 당근의 MongoDB CDC 구축기 (opens in new tab)

To optimize data synchronization and ensure production stability, Daangn’s data engineering team transitioned their MongoDB data pipeline from a resource-intensive full-dump method to a Change Data Capture (CDC) architecture. By leveraging Flink CDC, the team successfully reduced database CPU usage to under 60% while consistently meeting a two-hour data delivery Service Level Objective (SLO). This shift enables efficient, schema-agnostic data replication to BigQuery, facilitating high-scale analysis without compromising the performance of live services. ### Limitations of Traditional Dump Methods * The previous Spark Connector-based approach required full table scans, leading to a direct trade-off between hitting delivery deadlines and maintaining database health. * Increasing data volumes caused significant CPU spikes, threatening the stability of transaction processing in production environments. * Standard incremental loads were unreliable because many collections lacked consistent `updated_at` fields or required the tracking of hard deletes, which full dumps handle poorly at scale. ### Advantages of Flink CDC for MongoDB * Flink CDC provides native support for MongoDB Change Streams, allowing the system to read the Oplog directly and use resume tokens to restart from specific failure points. * The framework’s checkpointing mechanism ensures "Exactly-Once" processing by periodically saving the pipeline state to distributed storage like GCS or S3. * Unlike standalone tools like Debezium, Flink allows for an integrated "Extract-Transform-Load" (ETL) flow within a single job, reducing operational complexity and the need for intermediate message queues. * The architecture is horizontally scalable, meaning TaskManagers can be increased to handle sudden bursts in event volume without re-architecting the pipeline. ### Pipeline Architecture and Processing Logic * The core engine monitors MongoDB write operations (Insert, Update, Delete) in real-time via Change Streams and transmits them to BigQuery. * An hourly batch process is utilized rather than pure real-time streaming to prioritize operational stability, idempotency, and easier recovery from failures. * The downstream pipeline includes a Schema Evolution step that automatically detects and adds new fields to BigQuery tables, ensuring the NoSQL-to-SQL transition is seamless. * Data processing involves deduplicating recent change events and merging them into a raw JSON table before materializing them into a final structured table for end-users. For organizations managing large-scale MongoDB clusters, implementing Flink CDC serves as a powerful solution to balance analytical requirements with database performance. Prioritizing a robust, batch-integrated CDC flow allows teams to meet strict delivery targets and maintain data integrity without the infrastructure overhead of a fully real-time streaming system.