네이버 / mysql

2 posts

naver

스마트스토어센터 Oracle에서 MySQL로의 무중단 전환기 (opens in new tab)

Smart Store Center successfully migrated its legacy platform from Oracle to MySQL to overcome performance instability caused by resource contention and to reduce high licensing costs. By implementing a "dual write" strategy, the team achieved a zero-downtime transition while maintaining the ability to roll back immediately without data loss. This technical journey highlights the use of proxy data sources and transaction synchronization to ensure data integrity across disparate database environments. ## Zero-Downtime Migration via Dual Writing * The migration strategy relied on "dual writing," where all Create, Update, and Delete (CUD) operations are performed on both the legacy Oracle and the new MySQL databases. * In the pre-migration phase, Oracle served as the primary source for all traffic while MySQL recorded writes in the background to build a synchronized state. * Once data was fully migrated and verified, the primary traffic was shifted to MySQL, with background writes continuing to Oracle to allow for an instantaneous rollback if performance issues occurred. * This approach decoupled the database switch from application deployment, providing a safety net against critical failures that a simple redeploy could not fix. ## Technical Implementation for JPA * To capture and replicate queries, the team utilized the `datasource-proxy` library, which allowed them to intercept Oracle queries and execute them against a separate MySQL DataSource. * To prevent MySQL write failures from impacting the primary Oracle transactions, writes to the secondary database were managed using `TransactionSynchronizationManager`. * By executing MySQL queries during the `afterCommit` phase, the team ensured that the primary service remained stable even if the secondary database encountered errors or performance bottlenecks. * The transition required modifying JPA Entity configurations, such as changing primary key generation from Oracle Sequences to MySQL’s `IDENTITY` (auto-increment) and adjusting `columnDefinition` for types like `text`, `longtext`, and `decimal`. ## Centralized MyBatis Strategy * To avoid modifying thousands of business logic points in a 10-year-old codebase, the team sought a way to implement dual writing for MyBatis at the architectural level. * The implementation focused on the MyBatis `Configuration` and `MappedStatement` objects to capture SQL execution without requiring manual updates to individual repository interfaces. * This centralized approach maintained the purity of the business logic and ensured that the dual-write logic could be easily removed once the migration was fully stabilized. For organizations managing large-scale legacy migrations, the dual-write pattern combined with asynchronous transaction synchronization is a highly recommended safety mechanism. Prioritizing the isolation of secondary database failures ensures that the user experience remains unaffected while technical validation is performed in real-time.

naver

6개월 만에 연간 수십조를 처리하는 DB CDC 복제 도구 무중단/무장애 교체하기 (opens in new tab)

Naver Pay successfully transitioned its core database replication system from a legacy tool to "ergate," a high-performance CDC (Change Data Capture) solution built on Apache Flink and Spring. This strategic overhaul was designed to improve maintainability for backend developers while resolving rigid schema dependencies that previously caused operational bottlenecks. By leveraging a modern stream-processing architecture, the system now manages massive transaction volumes with sub-second latency and enhanced reliability. ### Limitations of the Legacy System * **Maintenance Barriers:** The previous tool, mig-data, was written in pure Java by database core specialists, making it difficult for standard backend developers to maintain or extend. * **Strict Schema Dependency:** Developers were forced to follow a rigid DDL execution order (Target DB before Source DB) to avoid replication halts, complicating database operations. * **Blocking Failures:** Because the legacy system prioritized bi-directional data integrity, a single failed record could stall the entire replication pipeline for a specific shard. * **Operational Risk:** Recovery procedures were manual and restricted to a small group of specialized personnel, increasing the time-to-recovery during outages. ### Technical Architecture and Stack * **Apache Flink (LTS 2.0.0):** Selected for its high-availability, low-latency, and native Kafka integration, allowing the team to focus on replication logic rather than infrastructure. * **Kubernetes Session Mode:** Used to manage 12 concurrent jobs (6 replication, 6 verification) through a single Job Manager endpoint for streamlined monitoring and deployment. * **Hybrid Framework Approach:** The team isolated high-speed replication logic within Flink while using Spring (Kotlin) for complex recovery modules to leverage developer familiarity. * **Data Pipeline:** The system captures MySQL binlogs via `nbase-cdc`, publishes them to Kafka, and uses Flink `jdbc-sink` jobs to apply changes to Target DBs (nBase-T and Oracle). ### Three-Tier Operational Model: Replication, Verification, and Recovery * **Real-time Replication:** Processes incoming Kafka records and appends custom metadata columns (`ergate_yn`, `rpc_time`) to track the replication source and original commit time. * **Delayed Verification:** A dedicated "verifier" Flink job consumes the same Kafka topic with a 2-minute delay to check Target DB consistency against the source record. * **Secondary Logic:** To prevent false positives from rapid updates, the verifier performs a live re-query of the Source DB if a mismatch is initially detected. * **Multi-Stage Recovery:** * **Automatic Short-term:** Retries transient failures after 5 minutes. * **Automatic Long-term:** Uses batch processes to resolve persistent discrepancies. * **Manual:** Provides an admin interface for developers to trigger targeted reconciliations via API. ### Improvements in Schema Management and Performance * **DDL Independence:** By implementing query and schema caching, ergate allows Source and Target tables to be updated in any order without halting the pipeline. * **Performance Scaling:** The new system is designed to handle 10x the current peak QPS, ensuring stability even during high-traffic events like major sales or promotions. * **Metadata Tracking:** The inclusion of specific replication identifiers allows for clear distinction between automated replication and manual force-sync actions during troubleshooting. The ergate project demonstrates that a hybrid architecture—combining the high-throughput processing of Apache Flink with the robust logic handling of Spring—is highly effective for mission-critical financial systems. Organizations managing large-scale data replication should consider decoupling complex recovery logic from the main processing stream to ensure both performance and developer productivity.