daangn

No Need to Fetch Everything Every Time (opens in new tab)

To optimize data synchronization and ensure production stability, Daangn’s data engineering team transitioned their MongoDB data pipeline from a resource-intensive full-dump method to a Change Data Capture (CDC) architecture. By leveraging Flink CDC, the team successfully reduced database CPU usage to under 60% while consistently meeting a two-hour data delivery Service Level Objective (SLO). This shift enables efficient, schema-agnostic data replication to BigQuery, facilitating high-scale analysis without compromising the performance of live services.

Limitations of Traditional Dump Methods

  • The previous Spark Connector-based approach required full table scans, leading to a direct trade-off between hitting delivery deadlines and maintaining database health.
  • Increasing data volumes caused significant CPU spikes, threatening the stability of transaction processing in production environments.
  • Standard incremental loads were unreliable because many collections lacked consistent updated_at fields or required the tracking of hard deletes, which full dumps handle poorly at scale.

Advantages of Flink CDC for MongoDB

  • Flink CDC provides native support for MongoDB Change Streams, allowing the system to read the Oplog directly and use resume tokens to restart from specific failure points.
  • The framework’s checkpointing mechanism ensures "Exactly-Once" processing by periodically saving the pipeline state to distributed storage like GCS or S3.
  • Unlike standalone tools like Debezium, Flink allows for an integrated "Extract-Transform-Load" (ETL) flow within a single job, reducing operational complexity and the need for intermediate message queues.
  • The architecture is horizontally scalable, meaning TaskManagers can be increased to handle sudden bursts in event volume without re-architecting the pipeline.

Pipeline Architecture and Processing Logic

  • The core engine monitors MongoDB write operations (Insert, Update, Delete) in real-time via Change Streams and transmits them to BigQuery.
  • An hourly batch process is utilized rather than pure real-time streaming to prioritize operational stability, idempotency, and easier recovery from failures.
  • The downstream pipeline includes a Schema Evolution step that automatically detects and adds new fields to BigQuery tables, ensuring the NoSQL-to-SQL transition is seamless.
  • Data processing involves deduplicating recent change events and merging them into a raw JSON table before materializing them into a final structured table for end-users.

For organizations managing large-scale MongoDB clusters, implementing Flink CDC serves as a powerful solution to balance analytical requirements with database performance. Prioritizing a robust, batch-integrated CDC flow allows teams to meet strict delivery targets and maintain data integrity without the infrastructure overhead of a fully real-time streaming system.