고객은 절대 기다려주지 않는다: 빠른 데이터 서빙으로 고객 만족도를 수직 상승 시키는 법 (opens in new tab)
Toss Payments addressed the challenge of serving rapidly growing transaction data within a microservices architecture (MSA) by evolving their data platform from simple Elasticsearch indexing to a robust CQRS pattern. While Apache Druid initially provided high-performance time-series aggregation and significant cost savings, the team eventually integrated StarRocks to overcome limitations in data consistency and complex join operations. This architectural journey highlights the necessity of balancing real-time query performance with operational scalability and domain decoupling. ### Transitioning to MSA and Early Search Solutions * The shift from a monolithic structure to MSA decoupled application logic but created "data silos" where joining ledgers across domains became difficult. * The initial solution utilized Elasticsearch to index specific fields for merchant transaction lookups and basic refunds. * As transaction volumes doubled between 2022 and 2024, the need for complex OLAP-style aggregations led to the adoption of a CQRS (Command Query Responsibility Segregation) architecture. ### Adopting Apache Druid for Time-Series Data * Druid was selected for its optimization toward time-series data, offering low-latency aggregation for massive datasets. * It provided a low learning curve by supporting Druid SQL and featured automatic bitmap indexing for all columns, including nested JSON keys. * The system decoupled reads from writes, allowing the data team to serve billions of records without impacting the primary transaction databases' resources. ### Data Ingestion: Message Publishing over CDC * The team chose a message publishing approach via Kafka rather than Change Data Capture (CDC) to minimize domain dependency. * In this model, domain teams publish finalized data packets, reducing the data team's need to maintain complex internal business logic for over 20 different payment methods. * This strategy simplified system dependencies and leveraged Druid’s ability to automatically index incoming JSON fields. ### Infrastructure and Cost Optimization in AWS * The architecture separates computing and storage, using AWS S3 for deep storage to keep costs low. * Performance was optimized by using instances with high-performance local storage instead of network-attached EBS, resulting in up to 9x faster I/O. * The team utilized Spot Instances for development and testing environments, contributing to a monthly cloud cost reduction of approximately 50 million KRW. ### Operational Challenges and Druid’s Limitations * **Idempotency and Consistency:** Druid struggled with native idempotency, requiring complex "Merge on Read" logic to handle duplicate messages or state changes. * **Data Fragmentation:** Transaction cancellations often targeted old partitions, causing fragmentation; the team implemented a 60-second detection process to trigger automatic compaction. * **Join Constraints:** While Druid supports joins, its capabilities are limited, making it difficult to link complex lifecycles across payment, purchase, and settlement domains. ### Hybrid Search and Rollup Performance * To ensure high-speed lookups across 10 billion records, a hybrid architecture was built: Elasticsearch handles specific keyword searches to retrieve IDs, which are then used to fetch full details from Druid. * Druid’s "Rollup" feature was utilized to pre-aggregate data at ingestion time. * Implementing Rollup reduced average query response times from tens of seconds to under 1 second, representing a 99% performance improvement for aggregate views. ### Moving Toward StarRocks * To solve Druid's limitations regarding idempotency and multi-table joins, Toss Payments began transitioning to StarRocks. * StarRocks provides a more stable environment for managing inconsistent events and simplifies the data flow by aligning with existing analytical infrastructure. * This shift supports the need for a "Unified Ledger" that can track the entire lifecycle of a transaction—from payment to net profit—across disparate database sources.