change-stream

1 posts

daangn

You don't need to fetch it (opens in new tab)

As Daangn’s data volume grew, their traditional full-dump approach using Spark for MongoDB began causing significant CPU spikes and failing to meet the two-hour data delivery Service Level Objectives (SLOs). To resolve this, the team implemented a Change Data Capture (CDC) pipeline using Flink CDC to synchronize data efficiently without the need for resource-intensive full table scans. This transition successfully stabilized database performance and ensured timely data availability in BigQuery by focusing on incremental change logs rather than repeated bulk extracts. ### Limitations of Traditional Dump Methods * The previous Spark Connector method required full table scans, creating a direct conflict between service stability and data freshness. * Attempts to lower DB load resulted in missing the 2-hour SLO, while meeting the SLO pushed CPU usage to dangerous levels. * Standard incremental loading was ruled out because it relied on `updated_at` fields, which were not consistently updated across all business logic or schemas. * The team targeted the top five largest and most frequently updated collections for the initial CDC transition to maximize performance gains. ### Advantages of Flink CDC * Flink CDC provides native support for MongoDB Change Streams, allowing the system to use resume tokens and Flink checkpoints for seamless recovery after failures. * It guarantees "Exactly-Once" processing by periodically saving the pipeline state to distributed storage, ensuring data integrity during restarts. * Unlike tools like Debezium that require separate systems for data processing, Flink handles the entire "Extract-Transform-Load" (ETL) lifecycle within a single job. * The architecture is horizontally scalable; increasing the number of TaskManagers allows the pipeline to handle surges in event volume with linear performance improvements. ### Pipeline Architecture and Implementation * The system utilizes the MongoDB Oplog to capture real-time write operations (inserts, updates, and deletes) which are then processed by Flink. * The backend pipeline operates on an hourly batch cycle to extract the latest change events, deduplicate them, and merge them into raw JSON tables in BigQuery. * A "Schema Evolution" step automatically detects and adds missing fields to BigQuery tables, bridging the gap between NoSQL flexibility and SQL structure. * While Flink captures data in real-time, the team opted for hourly materialization to maintain idempotency, simplify error recovery, and meet existing business requirements without unnecessary architectural complexity. For organizations managing large-scale MongoDB instances, moving from bulk extracts to a CDC-based model is a critical step in balancing database health with analytical needs. Implementing a unified framework like Flink CDC not only reduces the load on operational databases but also simplifies the management of complex data transformations and schema changes.