mongodb

3 posts

kakao

12 Reasons to Upgrade to MongoDB (opens in new tab)

MongoDB 8.0 marks a significant shift in the database's evolution, moving away from simple feature expansion to prioritize architectural stability and substantial performance gains. By addressing historical criticisms regarding write latency and query overhead, this release establishes a robust foundation for enterprise-scale applications requiring high throughput and long-term reliability. ### Extended Support and Release Strategy * MongoDB 8.0 is designated for five years of support (until October 2029), offering a stable "LTS-like" window that reduces the resource burden of frequent major upgrades. * The "Rapid Release" policy, previously exclusive to MongoDB Atlas, now extends to on-premise environments, allowing self-managed users to access minor release features and improvements more quickly. * This policy change provides DBAs with greater strategic flexibility to choose between prioritizing stability or adopting new features. ### Optimized "Majority" Write Concern * The criteria for "majority" write acknowledgment has shifted from `lastApplied` (when data is written to the data file) to `lastWritten` (when the entry is recorded in the `oplog.rs` collection). * This change bypasses the wait time for secondary nodes to physically apply changes to their storage engines, resulting in a 30–47% improvement in write throughput. * While this improves speed, applications that read from secondaries immediately after a write may need to implement Causally Consistent Sessions to ensure they see the most recent data. ### Efficient Bulk Operations * A new database-level `bulkWrite` command allows for operations across multiple collections within a single request, reducing network round-trip costs. * The system now groups multiple document inserts (up to a default of 500) into a single oplog entry instead of creating individual entries for every document. * This grouping aligns the oplog process with the WiredTiger storage engine’s internal batching, significantly reducing replication lag and improving overall write efficiency. ### High-Speed Indexing with Express Plan * MongoDB 8.0 introduces the "Express Plan" to optimize high-frequency, simple queries by bypassing the traditional multi-stage query optimizer. * Queries are eligible for this fast-track execution if they are point queries on the `_id` field or equality searches on fields with unique indexes (or queries using `limit: 1`). * By skipping the overhead of query parsing, normalization, and plan stage construction, the Express Plan maximizes CPU efficiency for the most common database interaction patterns. For organizations managing large-scale production environments, MongoDB 8.0 is a highly recommended upgrade. The combination of a five-year support lifecycle and fundamental improvements to replication and query execution makes it the most performant and operationally sound version of the database to date.

daangn

You don't need to fetch it (opens in new tab)

As Daangn’s data volume grew, their traditional full-dump approach using Spark for MongoDB began causing significant CPU spikes and failing to meet the two-hour data delivery Service Level Objectives (SLOs). To resolve this, the team implemented a Change Data Capture (CDC) pipeline using Flink CDC to synchronize data efficiently without the need for resource-intensive full table scans. This transition successfully stabilized database performance and ensured timely data availability in BigQuery by focusing on incremental change logs rather than repeated bulk extracts. ### Limitations of Traditional Dump Methods * The previous Spark Connector method required full table scans, creating a direct conflict between service stability and data freshness. * Attempts to lower DB load resulted in missing the 2-hour SLO, while meeting the SLO pushed CPU usage to dangerous levels. * Standard incremental loading was ruled out because it relied on `updated_at` fields, which were not consistently updated across all business logic or schemas. * The team targeted the top five largest and most frequently updated collections for the initial CDC transition to maximize performance gains. ### Advantages of Flink CDC * Flink CDC provides native support for MongoDB Change Streams, allowing the system to use resume tokens and Flink checkpoints for seamless recovery after failures. * It guarantees "Exactly-Once" processing by periodically saving the pipeline state to distributed storage, ensuring data integrity during restarts. * Unlike tools like Debezium that require separate systems for data processing, Flink handles the entire "Extract-Transform-Load" (ETL) lifecycle within a single job. * The architecture is horizontally scalable; increasing the number of TaskManagers allows the pipeline to handle surges in event volume with linear performance improvements. ### Pipeline Architecture and Implementation * The system utilizes the MongoDB Oplog to capture real-time write operations (inserts, updates, and deletes) which are then processed by Flink. * The backend pipeline operates on an hourly batch cycle to extract the latest change events, deduplicate them, and merge them into raw JSON tables in BigQuery. * A "Schema Evolution" step automatically detects and adds missing fields to BigQuery tables, bridging the gap between NoSQL flexibility and SQL structure. * While Flink captures data in real-time, the team opted for hourly materialization to maintain idempotency, simplify error recovery, and meet existing business requirements without unnecessary architectural complexity. For organizations managing large-scale MongoDB instances, moving from bulk extracts to a CDC-based model is a critical step in balancing database health with analytical needs. Implementing a unified framework like Flink CDC not only reduces the load on operational databases but also simplifies the management of complex data transformations and schema changes.

daangn

No Need to Fetch Everything Every Time (opens in new tab)

To optimize data synchronization and ensure production stability, Daangn’s data engineering team transitioned their MongoDB data pipeline from a resource-intensive full-dump method to a Change Data Capture (CDC) architecture. By leveraging Flink CDC, the team successfully reduced database CPU usage to under 60% while consistently meeting a two-hour data delivery Service Level Objective (SLO). This shift enables efficient, schema-agnostic data replication to BigQuery, facilitating high-scale analysis without compromising the performance of live services. ### Limitations of Traditional Dump Methods * The previous Spark Connector-based approach required full table scans, leading to a direct trade-off between hitting delivery deadlines and maintaining database health. * Increasing data volumes caused significant CPU spikes, threatening the stability of transaction processing in production environments. * Standard incremental loads were unreliable because many collections lacked consistent `updated_at` fields or required the tracking of hard deletes, which full dumps handle poorly at scale. ### Advantages of Flink CDC for MongoDB * Flink CDC provides native support for MongoDB Change Streams, allowing the system to read the Oplog directly and use resume tokens to restart from specific failure points. * The framework’s checkpointing mechanism ensures "Exactly-Once" processing by periodically saving the pipeline state to distributed storage like GCS or S3. * Unlike standalone tools like Debezium, Flink allows for an integrated "Extract-Transform-Load" (ETL) flow within a single job, reducing operational complexity and the need for intermediate message queues. * The architecture is horizontally scalable, meaning TaskManagers can be increased to handle sudden bursts in event volume without re-architecting the pipeline. ### Pipeline Architecture and Processing Logic * The core engine monitors MongoDB write operations (Insert, Update, Delete) in real-time via Change Streams and transmits them to BigQuery. * An hourly batch process is utilized rather than pure real-time streaming to prioritize operational stability, idempotency, and easier recovery from failures. * The downstream pipeline includes a Schema Evolution step that automatically detects and adds new fields to BigQuery tables, ensuring the NoSQL-to-SQL transition is seamless. * Data processing involves deduplicating recent change events and merging them into a raw JSON table before materializing them into a final structured table for end-users. For organizations managing large-scale MongoDB clusters, implementing Flink CDC serves as a powerful solution to balance analytical requirements with database performance. Prioritizing a robust, batch-integrated CDC flow allows teams to meet strict delivery targets and maintain data integrity without the infrastructure overhead of a fully real-time streaming system.