toss

From Legacy Payment Ledger to Scalable (opens in new tab)

Toss Payments successfully modernized a 20-year-old legacy payment ledger by transitioning to a decoupled, MySQL-based architecture designed for high scalability and consistency. By implementing strategies like INSERT-only immutability and event-driven domain isolation, they overcame structural limitations such as the inability to handle split payments. Ultimately, the project demonstrates that robust system design must be paired with resilient operational recovery mechanisms to manage the complexities of large-scale financial migrations.

Legacy Ledger Challenges

  • Inconsistent Schemas: Different payment methods used entirely different table structures; for instance, a table named REFUND unexpectedly contained only account transfer data rather than all refund types.
  • Domain Coupling: Multiple domains (settlement, accounting, and payments) shared the same tables and columns, meaning a single schema change required impact analysis across several teams.
  • Structural Limits: A rigid 1:1 relationship between a payment and its method prevented the implementation of modern features like split payments or "Dutch pay" models.

New Ledger Architecture

  • Data Immutability: The system shifted from updating existing rows to an INSERT-only principle, ensuring a reliable audit trail and preventing database deadlocks.
  • Event-Driven Decoupling: Instead of direct database access, the system uses Kafka to publish payment events, allowing independent domains to consume data without tight coupling.
  • Payment-Approval Separation: By separating the "Payment" (the transaction intent) from the "Approval" (the specific financial method), the system now supports multiple payment methods per transaction.

Safe Migration and Data Integrity

  • Asynchronous Mirroring: To maintain zero downtime, data was initially written to the legacy system and then asynchronously loaded into the new MySQL ledger.
  • Resource Tuning: Developers used dedicated migration servers within the same AWS Availability Zone to minimize latency and implemented Bulk Inserts to handle hundreds of millions of rows efficiently.
  • Verification Batches: A separate batch process ran every five minutes against a Read-Only (RO) database to identify and correct any data gaps caused by asynchronous processing failures.

Operational Resilience and Incident Response

  • Query Optimization: During a load spike, the MySQL optimizer chose "Full Scans" over indexes; the team resolved this by implementing SQL hints and utilizing a 5-version Docker image history for rapid rollbacks.
  • Network Cancellation: To handle timeouts between Toss and external card issuers, the system uses specific logic to automatically send cancellation requests and synchronize states.
  • Timeout Standardization: Discrepancies between microservices were resolved by calculating the maximum processing time of approval servers and aligning all upstream timeout settings to prevent merchant response mismatches.
  • Reliable Event Delivery: While using the Outbox pattern for events, the team added log-based recovery (Elasticsearch and local disk) and idempotency keys in event headers to handle both missing and duplicate messages.

For organizations tackling significant technical debt, this transition highlights that initial design is only half the battle. True system reliability comes from building "self-healing" structures—such as automated correction batches and standardized timeout chains—that can survive the unpredictable nature of live production environments.