kafka

2 posts

toss

From Legacy Payment Ledger to Scalable (opens in new tab)

Toss Payments successfully modernized a 20-year-old legacy payment ledger by transitioning to a decoupled, MySQL-based architecture designed for high scalability and consistency. By implementing strategies like INSERT-only immutability and event-driven domain isolation, they overcame structural limitations such as the inability to handle split payments. Ultimately, the project demonstrates that robust system design must be paired with resilient operational recovery mechanisms to manage the complexities of large-scale financial migrations. ### Legacy Ledger Challenges * **Inconsistent Schemas:** Different payment methods used entirely different table structures; for instance, a table named `REFUND` unexpectedly contained only account transfer data rather than all refund types. * **Domain Coupling:** Multiple domains (settlement, accounting, and payments) shared the same tables and columns, meaning a single schema change required impact analysis across several teams. * **Structural Limits:** A rigid 1:1 relationship between a payment and its method prevented the implementation of modern features like split payments or "Dutch pay" models. ### New Ledger Architecture * **Data Immutability:** The system shifted from updating existing rows to an **INSERT-only** principle, ensuring a reliable audit trail and preventing database deadlocks. * **Event-Driven Decoupling:** Instead of direct database access, the system uses Kafka to publish payment events, allowing independent domains to consume data without tight coupling. * **Payment-Approval Separation:** By separating the "Payment" (the transaction intent) from the "Approval" (the specific financial method), the system now supports multiple payment methods per transaction. ### Safe Migration and Data Integrity * **Asynchronous Mirroring:** To maintain zero downtime, data was initially written to the legacy system and then asynchronously loaded into the new MySQL ledger. * **Resource Tuning:** Developers used dedicated migration servers within the same AWS Availability Zone to minimize latency and implemented **Bulk Inserts** to handle hundreds of millions of rows efficiently. * **Verification Batches:** A separate batch process ran every five minutes against a Read-Only (RO) database to identify and correct any data gaps caused by asynchronous processing failures. ### Operational Resilience and Incident Response * **Query Optimization:** During a load spike, the MySQL optimizer chose "Full Scans" over indexes; the team resolved this by implementing SQL hints and utilizing a 5-version Docker image history for rapid rollbacks. * **Network Cancellation:** To handle timeouts between Toss and external card issuers, the system uses specific logic to automatically send cancellation requests and synchronize states. * **Timeout Standardization:** Discrepancies between microservices were resolved by calculating the maximum processing time of approval servers and aligning all upstream timeout settings to prevent merchant response mismatches. * **Reliable Event Delivery:** While using the **Outbox pattern** for events, the team added log-based recovery (Elasticsearch and local disk) and idempotency keys in event headers to handle both missing and duplicate messages. For organizations tackling significant technical debt, this transition highlights that initial design is only half the battle. True system reliability comes from building "self-healing" structures—such as automated correction batches and standardized timeout chains—that can survive the unpredictable nature of live production environments.

naver

Things to know when using Kafka in a (opens in new tab)

The Apache Kafka ecosystem is undergoing a significant architectural shift with the introduction of Consumer Group Protocol v2, as outlined in KIP-848. This update addresses long-standing performance bottlenecks and stability issues inherent in the original client-side rebalancing logic by moving the responsibility of partition assignment to the broker. This change effectively eliminates the "stop-the-world" effect during rebalances and significantly improves the scalability of large-scale consumer groups. ### Limitations of the Legacy Consumer Group Protocol (v1) * **Heavy Client-Side Logic:** In v1, the "Group Leader" (a specific consumer instance) is responsible for calculating partition assignments, which creates a heavy burden on the client and leads to inconsistent behavior across different programming language implementations. * **Stop-the-World Rebalancing:** Whenever a member joins or leaves the group, all consumers must stop processing data until the new assignment is synchronized, leading to significant latency spikes. * **Sensitivity to Processing Delays:** Because heartbeats and data processing often share the same thread, a slow consumer can trigger a session timeout, causing an unnecessary and disruptive group rebalance. ### Architectural Improvements in Protocol v2 * **Server-Side Reconciliation:** The reconciliation logic is moved to the Group Coordinator on the broker, simplifying the client and ensuring that partition assignment is managed centrally and consistently. * **Incremental Rebalancing:** Unlike the "eager" rebalancing of v1, the new protocol allows consumers to keep their existing partitions while negotiating new ones, ensuring continuous data processing. * **Decoupled Heartbeats:** The heartbeat mechanism is separated from the main processing loop, preventing "zombie member" scenarios where a busy consumer is incorrectly marked as dead. ### Performance and Scalability Gains * **Reduced Rebalance Latency:** By offloading the assignment logic to the broker, the time required to stabilize a group after a membership change is reduced from seconds to milliseconds. * **Large-Scale Group Support:** The new protocol is designed to handle thousands of partitions and hundreds of consumers within a single group without the exponential performance degradation seen in v1. * **Stable Deployments:** During rolling restarts or deployments, the group remains stable and avoids the "rebalance storms" that typically occur when multiple instances cycle at once. ### Migration and Practical Implementation * **Configuration Requirements:** Users can opt-in to the new protocol by setting the `group.protocol` configuration to `consumer` (introduced as early access in Kafka 3.7 and standard in 4.0). * **Compatibility:** While the new protocol requires updated brokers and clients, it is designed to support a transition phase to allow organizations to migrate their workloads gradually. * **New Tooling:** Updated command-line tools and metrics are provided to monitor the server-side assignment process and track group state more granularly. Organizations experiencing frequent rebalance issues or managing high-throughput Kafka clusters should plan for a migration to Consumer Group Protocol v2. Transitioning to this server-side assignment model is highly recommended for stabilizing production environments and reducing the operational overhead associated with consumer group management.