당근 / mongodb

2 posts

daangn

매번 다 퍼올 필요 없잖아? 당근의 MongoDB CDC 구축기 (opens in new tab)

As Daangn’s data volume grew, their traditional full-dump approach using Spark for MongoDB began causing significant CPU spikes and failing to meet the two-hour data delivery Service Level Objectives (SLOs). To resolve this, the team implemented a Change Data Capture (CDC) pipeline using Flink CDC to synchronize data efficiently without the need for resource-intensive full table scans. This transition successfully stabilized database performance and ensured timely data availability in BigQuery by focusing on incremental change logs rather than repeated bulk extracts. ### Limitations of Traditional Dump Methods * The previous Spark Connector method required full table scans, creating a direct conflict between service stability and data freshness. * Attempts to lower DB load resulted in missing the 2-hour SLO, while meeting the SLO pushed CPU usage to dangerous levels. * Standard incremental loading was ruled out because it relied on `updated_at` fields, which were not consistently updated across all business logic or schemas. * The team targeted the top five largest and most frequently updated collections for the initial CDC transition to maximize performance gains. ### Advantages of Flink CDC * Flink CDC provides native support for MongoDB Change Streams, allowing the system to use resume tokens and Flink checkpoints for seamless recovery after failures. * It guarantees "Exactly-Once" processing by periodically saving the pipeline state to distributed storage, ensuring data integrity during restarts. * Unlike tools like Debezium that require separate systems for data processing, Flink handles the entire "Extract-Transform-Load" (ETL) lifecycle within a single job. * The architecture is horizontally scalable; increasing the number of TaskManagers allows the pipeline to handle surges in event volume with linear performance improvements. ### Pipeline Architecture and Implementation * The system utilizes the MongoDB Oplog to capture real-time write operations (inserts, updates, and deletes) which are then processed by Flink. * The backend pipeline operates on an hourly batch cycle to extract the latest change events, deduplicate them, and merge them into raw JSON tables in BigQuery. * A "Schema Evolution" step automatically detects and adds missing fields to BigQuery tables, bridging the gap between NoSQL flexibility and SQL structure. * While Flink captures data in real-time, the team opted for hourly materialization to maintain idempotency, simplify error recovery, and meet existing business requirements without unnecessary architectural complexity. For organizations managing large-scale MongoDB instances, moving from bulk extracts to a CDC-based model is a critical step in balancing database health with analytical needs. Implementing a unified framework like Flink CDC not only reduces the load on operational databases but also simplifies the management of complex data transformations and schema changes.

daangn

매번 다 퍼올 필요 없잖아? 당근의 MongoDB CDC 구축기 (opens in new tab)

To optimize data synchronization and ensure production stability, Daangn’s data engineering team transitioned their MongoDB data pipeline from a resource-intensive full-dump method to a Change Data Capture (CDC) architecture. By leveraging Flink CDC, the team successfully reduced database CPU usage to under 60% while consistently meeting a two-hour data delivery Service Level Objective (SLO). This shift enables efficient, schema-agnostic data replication to BigQuery, facilitating high-scale analysis without compromising the performance of live services. ### Limitations of Traditional Dump Methods * The previous Spark Connector-based approach required full table scans, leading to a direct trade-off between hitting delivery deadlines and maintaining database health. * Increasing data volumes caused significant CPU spikes, threatening the stability of transaction processing in production environments. * Standard incremental loads were unreliable because many collections lacked consistent `updated_at` fields or required the tracking of hard deletes, which full dumps handle poorly at scale. ### Advantages of Flink CDC for MongoDB * Flink CDC provides native support for MongoDB Change Streams, allowing the system to read the Oplog directly and use resume tokens to restart from specific failure points. * The framework’s checkpointing mechanism ensures "Exactly-Once" processing by periodically saving the pipeline state to distributed storage like GCS or S3. * Unlike standalone tools like Debezium, Flink allows for an integrated "Extract-Transform-Load" (ETL) flow within a single job, reducing operational complexity and the need for intermediate message queues. * The architecture is horizontally scalable, meaning TaskManagers can be increased to handle sudden bursts in event volume without re-architecting the pipeline. ### Pipeline Architecture and Processing Logic * The core engine monitors MongoDB write operations (Insert, Update, Delete) in real-time via Change Streams and transmits them to BigQuery. * An hourly batch process is utilized rather than pure real-time streaming to prioritize operational stability, idempotency, and easier recovery from failures. * The downstream pipeline includes a Schema Evolution step that automatically detects and adds new fields to BigQuery tables, ensuring the NoSQL-to-SQL transition is seamless. * Data processing involves deduplicating recent change events and merging them into a raw JSON table before materializing them into a final structured table for end-users. For organizations managing large-scale MongoDB clusters, implementing Flink CDC serves as a powerful solution to balance analytical requirements with database performance. Prioritizing a robust, batch-integrated CDC flow allows teams to meet strict delivery targets and maintain data integrity without the infrastructure overhead of a fully real-time streaming system.