Customers Never Wait: How to (opens in new tab)
Toss Payments addressed the challenge of serving rapidly growing transaction data within a microservices architecture (MSA) by evolving their data platform from simple Elasticsearch indexing to a robust CQRS pattern. While Apache Druid initially provided high-performance time-series aggregation and significant cost savings, the team eventually integrated StarRocks to overcome limitations in data consistency and complex join operations. This architectural journey highlights the necessity of balancing real-time query performance with operational scalability and domain decoupling.
Transitioning to MSA and Early Search Solutions
- The shift from a monolithic structure to MSA decoupled application logic but created "data silos" where joining ledgers across domains became difficult.
- The initial solution utilized Elasticsearch to index specific fields for merchant transaction lookups and basic refunds.
- As transaction volumes doubled between 2022 and 2024, the need for complex OLAP-style aggregations led to the adoption of a CQRS (Command Query Responsibility Segregation) architecture.
Adopting Apache Druid for Time-Series Data
- Druid was selected for its optimization toward time-series data, offering low-latency aggregation for massive datasets.
- It provided a low learning curve by supporting Druid SQL and featured automatic bitmap indexing for all columns, including nested JSON keys.
- The system decoupled reads from writes, allowing the data team to serve billions of records without impacting the primary transaction databases' resources.
Data Ingestion: Message Publishing over CDC
- The team chose a message publishing approach via Kafka rather than Change Data Capture (CDC) to minimize domain dependency.
- In this model, domain teams publish finalized data packets, reducing the data team's need to maintain complex internal business logic for over 20 different payment methods.
- This strategy simplified system dependencies and leveraged Druid’s ability to automatically index incoming JSON fields.
Infrastructure and Cost Optimization in AWS
- The architecture separates computing and storage, using AWS S3 for deep storage to keep costs low.
- Performance was optimized by using instances with high-performance local storage instead of network-attached EBS, resulting in up to 9x faster I/O.
- The team utilized Spot Instances for development and testing environments, contributing to a monthly cloud cost reduction of approximately 50 million KRW.
Operational Challenges and Druid’s Limitations
- Idempotency and Consistency: Druid struggled with native idempotency, requiring complex "Merge on Read" logic to handle duplicate messages or state changes.
- Data Fragmentation: Transaction cancellations often targeted old partitions, causing fragmentation; the team implemented a 60-second detection process to trigger automatic compaction.
- Join Constraints: While Druid supports joins, its capabilities are limited, making it difficult to link complex lifecycles across payment, purchase, and settlement domains.
Hybrid Search and Rollup Performance
- To ensure high-speed lookups across 10 billion records, a hybrid architecture was built: Elasticsearch handles specific keyword searches to retrieve IDs, which are then used to fetch full details from Druid.
- Druid’s "Rollup" feature was utilized to pre-aggregate data at ingestion time.
- Implementing Rollup reduced average query response times from tens of seconds to under 1 second, representing a 99% performance improvement for aggregate views.
Moving Toward StarRocks
- To solve Druid's limitations regarding idempotency and multi-table joins, Toss Payments began transitioning to StarRocks.
- StarRocks provides a more stable environment for managing inconsistent events and simplifies the data flow by aligning with existing analytical infrastructure.
- This shift supports the need for a "Unified Ledger" that can track the entire lifecycle of a transaction—from payment to net profit—across disparate database sources.