Benjamin Barton We shipped a feature that made perfect sense. It improved a specific type of investigation we had been testing against. Then other investigations started getting worse. Nothing crashed. No tests failed. But the overall quality of the agent had shifted, and we had…
들어가며 안녕하세요. LINE Plus에서 Global E-Commerce 개발을 맡고 있는 장효택입니다. 기존 시스템을 새로운 환경으로 옮기거나 내재화하는 작업은 개발자에게 숙명과도 같습니다. 이때 가장 곤혹스러운 순간은 기존 로직의 근거가 되는 기획서가 없거나, 소스 코드조차 참조할 수 없는 블랙박스 상태일 때입니다. 저희는 외부 시스템에 의존하던 다양한 모듈을 내재화하는 과정에서 이 문제에 직면했습니다. ‘지금 우리가 만든 코드가 기존과 정말 동일하게 작동하는가?’라는 질문에 답하기 위해,…
Toss Payments successfully modernized a 20-year-old legacy payment ledger by transitioning to a decoupled, MySQL-based architecture designed for high scalability and consistency. By implementing strategies like INSERT-only immutability and event-driven domain isolation, they overcame structural limitations such as the inability to handle split payments. Ultimately, the project demonstrates that robust system design must be paired with resilient operational recovery mechanisms to manage the complexities of large-scale financial migrations.
### Legacy Ledger Challenges
* **Inconsistent Schemas:** Different payment methods used entirely different table structures; for instance, a table named `REFUND` unexpectedly contained only account transfer data rather than all refund types.
* **Domain Coupling:** Multiple domains (settlement, accounting, and payments) shared the same tables and columns, meaning a single schema change required impact analysis across several teams.
* **Structural Limits:** A rigid 1:1 relationship between a payment and its method prevented the implementation of modern features like split payments or "Dutch pay" models.
### New Ledger Architecture
* **Data Immutability:** The system shifted from updating existing rows to an **INSERT-only** principle, ensuring a reliable audit trail and preventing database deadlocks.
* **Event-Driven Decoupling:** Instead of direct database access, the system uses Kafka to publish payment events, allowing independent domains to consume data without tight coupling.
* **Payment-Approval Separation:** By separating the "Payment" (the transaction intent) from the "Approval" (the specific financial method), the system now supports multiple payment methods per transaction.
### Safe Migration and Data Integrity
* **Asynchronous Mirroring:** To maintain zero downtime, data was initially written to the legacy system and then asynchronously loaded into the new MySQL ledger.
* **Resource Tuning:** Developers used dedicated migration servers within the same AWS Availability Zone to minimize latency and implemented **Bulk Inserts** to handle hundreds of millions of rows efficiently.
* **Verification Batches:** A separate batch process ran every five minutes against a Read-Only (RO) database to identify and correct any data gaps caused by asynchronous processing failures.
### Operational Resilience and Incident Response
* **Query Optimization:** During a load spike, the MySQL optimizer chose "Full Scans" over indexes; the team resolved this by implementing SQL hints and utilizing a 5-version Docker image history for rapid rollbacks.
* **Network Cancellation:** To handle timeouts between Toss and external card issuers, the system uses specific logic to automatically send cancellation requests and synchronize states.
* **Timeout Standardization:** Discrepancies between microservices were resolved by calculating the maximum processing time of approval servers and aligning all upstream timeout settings to prevent merchant response mismatches.
* **Reliable Event Delivery:** While using the **Outbox pattern** for events, the team added log-based recovery (Elasticsearch and local disk) and idempotency keys in event headers to handle both missing and duplicate messages.
For organizations tackling significant technical debt, this transition highlights that initial design is only half the battle. True system reliability comes from building "self-healing" structures—such as automated correction batches and standardized timeout chains—that can survive the unpredictable nature of live production environments.
The Apache Kafka ecosystem is undergoing a significant architectural shift with the introduction of Consumer Group Protocol v2, as outlined in KIP-848. This update addresses long-standing performance bottlenecks and stability issues inherent in the original client-side rebalancing logic by moving the responsibility of partition assignment to the broker. This change effectively eliminates the "stop-the-world" effect during rebalances and significantly improves the scalability of large-scale consumer groups.
### Limitations of the Legacy Consumer Group Protocol (v1)
* **Heavy Client-Side Logic:** In v1, the "Group Leader" (a specific consumer instance) is responsible for calculating partition assignments, which creates a heavy burden on the client and leads to inconsistent behavior across different programming language implementations.
* **Stop-the-World Rebalancing:** Whenever a member joins or leaves the group, all consumers must stop processing data until the new assignment is synchronized, leading to significant latency spikes.
* **Sensitivity to Processing Delays:** Because heartbeats and data processing often share the same thread, a slow consumer can trigger a session timeout, causing an unnecessary and disruptive group rebalance.
### Architectural Improvements in Protocol v2
* **Server-Side Reconciliation:** The reconciliation logic is moved to the Group Coordinator on the broker, simplifying the client and ensuring that partition assignment is managed centrally and consistently.
* **Incremental Rebalancing:** Unlike the "eager" rebalancing of v1, the new protocol allows consumers to keep their existing partitions while negotiating new ones, ensuring continuous data processing.
* **Decoupled Heartbeats:** The heartbeat mechanism is separated from the main processing loop, preventing "zombie member" scenarios where a busy consumer is incorrectly marked as dead.
### Performance and Scalability Gains
* **Reduced Rebalance Latency:** By offloading the assignment logic to the broker, the time required to stabilize a group after a membership change is reduced from seconds to milliseconds.
* **Large-Scale Group Support:** The new protocol is designed to handle thousands of partitions and hundreds of consumers within a single group without the exponential performance degradation seen in v1.
* **Stable Deployments:** During rolling restarts or deployments, the group remains stable and avoids the "rebalance storms" that typically occur when multiple instances cycle at once.
### Migration and Practical Implementation
* **Configuration Requirements:** Users can opt-in to the new protocol by setting the `group.protocol` configuration to `consumer` (introduced as early access in Kafka 3.7 and standard in 4.0).
* **Compatibility:** While the new protocol requires updated brokers and clients, it is designed to support a transition phase to allow organizations to migrate their workloads gradually.
* **New Tooling:** Updated command-line tools and metrics are provided to monitor the server-side assignment process and track group state more granularly.
Organizations experiencing frequent rebalance issues or managing high-throughput Kafka clusters should plan for a migration to Consumer Group Protocol v2. Transitioning to this server-side assignment model is highly recommended for stabilizing production environments and reducing the operational overhead associated with consumer group management.
Sanketh Balakrishna Andrew Zhang At Datadog, we operate thousands of services that rely on consistent, low-latency data access. Moving data between diverse systems—quickly and reliably—is essential, but complex. Each application may have its own requirements for data freshness,…
Kai Zong Khor William Yu Many Datadog products offer a live view of their telemetry, allowing you to access your data in near real time from across your infrastructure. Live views improve responsiveness, but they also introduce strict requirements on data ingestion latency and s…
Gabriel Reid Package delivery services like UPS, FedEx, and the postal service have a tough job. They contend with a never-ending stream of packages to be delivered, with expectations of prompt and reliable delivery. In building the Datadog Log Forwarding feature, we had to cont…
Gabriel Reid When you think about engineering challenges faced in handling incoming log data in Datadog, one of the first things that comes to mind is the scale: quickly and reliably parsing and processing millions of logs per second over thousands of containers. By comparison,…
Guillaume Bort At Datadog, we use Apache Kafka as a durable, flexible buffer in our Data Platform. Ingesting hundreds of trillions of observability events each day requires massive Kafka deployments: hundreds of clusters, thousands of topics, and millions of partitions spanning…
Artem Krylysov May Lee Datadog collects billions of events from millions of hosts every minute and that number keeps growing and fast. Our data volumes grew 30x between 2017 and 2022. On top of that, the kind of queries we receive from our users has changed significantly. Why? B…
Daniel Intskirveli Cecilia Watt We introduced Husky in a previous blog post (Introducing Husky, Datadog's Third-Generation Event Store) as Datadog's third-generation event store. To recap, Husky is a distributed, time-series oriented, columnar store optimized for streaming inges…
Richard Artoul Cecilia Watt This is the story of "Husky", a new event storage system we built at Datadog. Building a new storage system is a fun and exciting undertaking—and one that shouldn’t be taken lightly. Most importantly, it does not happen in a vacuum. To understand the…
Jamie Alquiza For a company with data in the name, it’s no surprise that we ingest large amounts of it. Kafka is our messaging persistence layer of choice underlying many of our high-traffic services. Consequently, our Kafka usage is quite high: the intake of trillions of data p…