mysql

5 posts

naver

Smart Store Center's (opens in new tab)

Smart Store Center successfully migrated its legacy platform from Oracle to MySQL to overcome performance instability caused by resource contention and to reduce high licensing costs. By implementing a "dual write" strategy, the team achieved a zero-downtime transition while maintaining the ability to roll back immediately without data loss. This technical journey highlights the use of proxy data sources and transaction synchronization to ensure data integrity across disparate database environments. ## Zero-Downtime Migration via Dual Writing * The migration strategy relied on "dual writing," where all Create, Update, and Delete (CUD) operations are performed on both the legacy Oracle and the new MySQL databases. * In the pre-migration phase, Oracle served as the primary source for all traffic while MySQL recorded writes in the background to build a synchronized state. * Once data was fully migrated and verified, the primary traffic was shifted to MySQL, with background writes continuing to Oracle to allow for an instantaneous rollback if performance issues occurred. * This approach decoupled the database switch from application deployment, providing a safety net against critical failures that a simple redeploy could not fix. ## Technical Implementation for JPA * To capture and replicate queries, the team utilized the `datasource-proxy` library, which allowed them to intercept Oracle queries and execute them against a separate MySQL DataSource. * To prevent MySQL write failures from impacting the primary Oracle transactions, writes to the secondary database were managed using `TransactionSynchronizationManager`. * By executing MySQL queries during the `afterCommit` phase, the team ensured that the primary service remained stable even if the secondary database encountered errors or performance bottlenecks. * The transition required modifying JPA Entity configurations, such as changing primary key generation from Oracle Sequences to MySQL’s `IDENTITY` (auto-increment) and adjusting `columnDefinition` for types like `text`, `longtext`, and `decimal`. ## Centralized MyBatis Strategy * To avoid modifying thousands of business logic points in a 10-year-old codebase, the team sought a way to implement dual writing for MyBatis at the architectural level. * The implementation focused on the MyBatis `Configuration` and `MappedStatement` objects to capture SQL execution without requiring manual updates to individual repository interfaces. * This centralized approach maintained the purity of the business logic and ensured that the dual-write logic could be easily removed once the migration was fully stabilized. For organizations managing large-scale legacy migrations, the dual-write pattern combined with asynchronous transaction synchronization is a highly recommended safety mechanism. Prioritizing the isolation of secondary database failures ensures that the user experience remains unaffected while technical validation is performed in real-time.

toss

From Legacy Payment Ledger to Scalable (opens in new tab)

Toss Payments successfully modernized a 20-year-old legacy payment ledger by transitioning to a decoupled, MySQL-based architecture designed for high scalability and consistency. By implementing strategies like INSERT-only immutability and event-driven domain isolation, they overcame structural limitations such as the inability to handle split payments. Ultimately, the project demonstrates that robust system design must be paired with resilient operational recovery mechanisms to manage the complexities of large-scale financial migrations. ### Legacy Ledger Challenges * **Inconsistent Schemas:** Different payment methods used entirely different table structures; for instance, a table named `REFUND` unexpectedly contained only account transfer data rather than all refund types. * **Domain Coupling:** Multiple domains (settlement, accounting, and payments) shared the same tables and columns, meaning a single schema change required impact analysis across several teams. * **Structural Limits:** A rigid 1:1 relationship between a payment and its method prevented the implementation of modern features like split payments or "Dutch pay" models. ### New Ledger Architecture * **Data Immutability:** The system shifted from updating existing rows to an **INSERT-only** principle, ensuring a reliable audit trail and preventing database deadlocks. * **Event-Driven Decoupling:** Instead of direct database access, the system uses Kafka to publish payment events, allowing independent domains to consume data without tight coupling. * **Payment-Approval Separation:** By separating the "Payment" (the transaction intent) from the "Approval" (the specific financial method), the system now supports multiple payment methods per transaction. ### Safe Migration and Data Integrity * **Asynchronous Mirroring:** To maintain zero downtime, data was initially written to the legacy system and then asynchronously loaded into the new MySQL ledger. * **Resource Tuning:** Developers used dedicated migration servers within the same AWS Availability Zone to minimize latency and implemented **Bulk Inserts** to handle hundreds of millions of rows efficiently. * **Verification Batches:** A separate batch process ran every five minutes against a Read-Only (RO) database to identify and correct any data gaps caused by asynchronous processing failures. ### Operational Resilience and Incident Response * **Query Optimization:** During a load spike, the MySQL optimizer chose "Full Scans" over indexes; the team resolved this by implementing SQL hints and utilizing a 5-version Docker image history for rapid rollbacks. * **Network Cancellation:** To handle timeouts between Toss and external card issuers, the system uses specific logic to automatically send cancellation requests and synchronize states. * **Timeout Standardization:** Discrepancies between microservices were resolved by calculating the maximum processing time of approval servers and aligning all upstream timeout settings to prevent merchant response mismatches. * **Reliable Event Delivery:** While using the **Outbox pattern** for events, the team added log-based recovery (Elasticsearch and local disk) and idempotency keys in event headers to handle both missing and duplicate messages. For organizations tackling significant technical debt, this transition highlights that initial design is only half the battle. True system reliability comes from building "self-healing" structures—such as automated correction batches and standardized timeout chains—that can survive the unpredictable nature of live production environments.

naver

Replacing a DB CDC replication tool that processes (opens in new tab)

Naver Pay successfully transitioned its core database replication system from a legacy tool to "ergate," a high-performance CDC (Change Data Capture) solution built on Apache Flink and Spring. This strategic overhaul was designed to improve maintainability for backend developers while resolving rigid schema dependencies that previously caused operational bottlenecks. By leveraging a modern stream-processing architecture, the system now manages massive transaction volumes with sub-second latency and enhanced reliability. ### Limitations of the Legacy System * **Maintenance Barriers:** The previous tool, mig-data, was written in pure Java by database core specialists, making it difficult for standard backend developers to maintain or extend. * **Strict Schema Dependency:** Developers were forced to follow a rigid DDL execution order (Target DB before Source DB) to avoid replication halts, complicating database operations. * **Blocking Failures:** Because the legacy system prioritized bi-directional data integrity, a single failed record could stall the entire replication pipeline for a specific shard. * **Operational Risk:** Recovery procedures were manual and restricted to a small group of specialized personnel, increasing the time-to-recovery during outages. ### Technical Architecture and Stack * **Apache Flink (LTS 2.0.0):** Selected for its high-availability, low-latency, and native Kafka integration, allowing the team to focus on replication logic rather than infrastructure. * **Kubernetes Session Mode:** Used to manage 12 concurrent jobs (6 replication, 6 verification) through a single Job Manager endpoint for streamlined monitoring and deployment. * **Hybrid Framework Approach:** The team isolated high-speed replication logic within Flink while using Spring (Kotlin) for complex recovery modules to leverage developer familiarity. * **Data Pipeline:** The system captures MySQL binlogs via `nbase-cdc`, publishes them to Kafka, and uses Flink `jdbc-sink` jobs to apply changes to Target DBs (nBase-T and Oracle). ### Three-Tier Operational Model: Replication, Verification, and Recovery * **Real-time Replication:** Processes incoming Kafka records and appends custom metadata columns (`ergate_yn`, `rpc_time`) to track the replication source and original commit time. * **Delayed Verification:** A dedicated "verifier" Flink job consumes the same Kafka topic with a 2-minute delay to check Target DB consistency against the source record. * **Secondary Logic:** To prevent false positives from rapid updates, the verifier performs a live re-query of the Source DB if a mismatch is initially detected. * **Multi-Stage Recovery:** * **Automatic Short-term:** Retries transient failures after 5 minutes. * **Automatic Long-term:** Uses batch processes to resolve persistent discrepancies. * **Manual:** Provides an admin interface for developers to trigger targeted reconciliations via API. ### Improvements in Schema Management and Performance * **DDL Independence:** By implementing query and schema caching, ergate allows Source and Target tables to be updated in any order without halting the pipeline. * **Performance Scaling:** The new system is designed to handle 10x the current peak QPS, ensuring stability even during high-traffic events like major sales or promotions. * **Metadata Tracking:** The inclusion of specific replication identifiers allows for clear distinction between automated replication and manual force-sync actions during troubleshooting. The ergate project demonstrates that a hybrid architecture—combining the high-throughput processing of Apache Flink with the robust logic handling of Spring—is highly effective for mission-critical financial systems. Organizations managing large-scale data replication should consider decoupling complex recovery logic from the main processing stream to ensure both performance and developer productivity.

line

Replacing the Payment System DB Handling (opens in new tab)

The LINE Billing Platform successfully migrated its large-scale payment database from Nbase-T to Vitess to handle high-traffic global transactions. While initially exploring gRPC for its performance reputation, the team transitioned to the MySQL protocol to ensure stability and reduce CPU overhead within their Java-based environment. This implementation demonstrates how Vitess can manage complex sharding requirements while maintaining high availability through automated recovery tools. ### Protocol Selection and Implementation - The team initially attempted to use the gRPC protocol but encountered `http2: frame too large` errors and significant CPU overhead during performance testing. - Manual mapping of query results to Java objects proved cumbersome with the Vitess gRPC client, leading to a shift toward the more mature and recommended MySQL protocol. - Using the MySQL protocol allowed the team to leverage standard database drivers while benefiting from Vitess's routing capabilities via VTGate. ### Keyspace Architecture and Data Routing - The system utilizes a dual-keyspace strategy: a "Global Keyspace" for unsharded metadata and a "Service Keyspace" for sharded transaction data. - The Global Keyspace manages sharding keys using a "sequence" table type to ensure unique, auto-incrementing identifiers across the platform. - The Service Keyspace is partitioned into $N$ shards using a hash-based Vindex, which distributes coin balances and transaction history. - VTGate automatically routes queries to the correct shard by analyzing the sharding key in the `WHERE` clause or `INSERT` statement, minimizing cross-shard overhead. ### MySQL Compatibility and Transaction Logic - Vitess maintains `REPEATABLE READ` isolation for single-shard transactions, while multi-shard transactions default to `READ COMMITTED`. - Advanced features like Two-Phase Commit (2PC) are available for handling distributed transactions across multiple shards. - Query execution plans are analyzed using `VEXPLAIN` and `VTEXPLAIN`, often managed through the VTAdmin web interface for better visibility. - Certain limitations apply, such as temporary tables only being supported in unsharded keyspaces and specific unsupported SQL cases documented in the Vitess core. ### Automated Operations and Monitoring - The team employs VTOrc (based on Orchestrator) to automatically detect and repair database failures, such as unreachable primaries or replication stops. - Monitoring is centralized via Prometheus, which scrapes metrics from VTOrc, VTGate, and VTTablet components at dedicated ports (e.g., 16000). - Real-time alerts are routed through Slack and email, using `tablet_alias` to specifically identify which MySQL node or VTTablet is experiencing issues. - A web-based recovery dashboard provides a history of automated fixes, allowing operators to track the health of the cluster over time. For organizations migrating high-traffic legacy systems to a cloud-native sharding solution, prioritizing the MySQL protocol over gRPC is recommended for better compatibility with existing application frameworks and reduced operational complexity.

line

Replacing the Database of a Payment System (opens in new tab)

The LINE Billing Platform team recently migrated its core payment database from Nbase-T to Vitess to address rising licensing costs while maintaining the high availability required for financial transactions. After a rigorous Proof of Concept (PoC) evaluating Apache ShardingSphere, TiDB, and Vitess, the team selected Vitess for its mature sharding capabilities and its ability to provide a stable, scalable environment on bare-metal infrastructure. This migration ensures the platform can handle large-scale traffic efficiently without the financial burden of proprietary license fees. ### Evaluation of Alternative Sharding Solutions Before settling on Vitess, the team analyzed other prominent distributed database technologies to determine their fit for a high-stakes payment system: * **Apache ShardingSphere:** While it offers flexible Proxy and JDBC layers, it was excluded because it requires significant manual effort for data resharding and rebalancing. The management overhead for implementing shard-key logic across various components (API, batch, admin) was deemed too high. * **TiDB:** This MySQL-compatible distributed database uses a decoupled architecture consisting of TiDB (SQL layer), PD (metadata management), and TiKV (row-based storage). Its primary advantage is automatic rebalancing and the lack of a required shard key, which significantly reduces DBA operational costs. * **Nbase-T:** The legacy system provided the highest performance efficiency per resource unit; however, the shift from a free to a paid licensing model necessitated the move to an open-source alternative. ### Vitess Architecture and Core Components Vitess was chosen for its proven track record at companies like YouTube and GitHub, offering a robust abstraction layer that makes a clustered database appear as a single instance to the application. The system relies on several specialized components: * **VTGate:** A proxy server that routes queries to the correct VTTablet, manages distributed transactions, and hides the physical topology of the database from the application. * **VTTablet:** A sidecar process running alongside each MySQL instance that manages query execution, data replication, and connection pooling. * **VTorc and Topology Server:** High availability is managed by VTorc (an automated failover tool), while metadata regarding shard locations and node status is synchronized via a topology server using ZooKeeper or etcd. ### PoC Performance and Environment Setup The team conducted performance testing by simulating real payment API scenarios (a mix of reads and writes) on standardized hardware (8vCPU, 16GB RAM). * **Comparison Metrics:** The tests focused on Transactions Per Second (TPS) and resource utilization as thread counts increased. * **Infrastructure Strategy:** Because payment systems cannot tolerate even brief failover delays, the team opted for a bare-metal deployment rather than a containerized one to ensure maximum stability and performance. * **Resource Efficiency:** While Nbase-T showed the best raw efficiency, Vitess demonstrated the necessary scalability and management features required to replace the legacy system effectively within the new cost constraints. ### Practical Recommendation For organizations managing critical core systems that require horizontal scaling without proprietary lock-in, Vitess is a highly recommended solution. While it requires a deep understanding of its various components (like VTGate and VTTablet) and careful configuration of its topology server, the trade-off is a mature, cloud-native-ready architecture that supports massive scale and automated failover on both bare-metal and cloud environments.