toss Dec 15, 2025

Customers Never Wait: How to (opens in new tab)

apache-kafka data-architecture elasticsearch apache-druid starrocks olap cqrs msa

Toss Payments addressed the challenge of serving rapidly growing transaction data within a microservices architecture (MSA) by evolving their data platform from simple Elasticsearch indexing to a robust CQRS pattern. While Apache Druid initially provided high-performance time-series aggregation and significant cost savings, the team eventually integrated StarRocks to overcome limitations in data consistency and complex join operations. This architectural journey highlights the necessity of balancing real-time query performance with operational scalability and domain decoupling.

Transitioning to MSA and Early Search Solutions

The shift from a monolithic structure to MSA decoupled application logic but created "data silos" where joining ledgers across domains became difficult.
The initial solution utilized Elasticsearch to index specific fields for merchant transaction lookups and basic refunds.
As transaction volumes doubled between 2022 and 2024, the need for complex OLAP-style aggregations led to the adoption of a CQRS (Command Query Responsibility Segregation) architecture.

Adopting Apache Druid for Time-Series Data

Druid was selected for its optimization toward time-series data, offering low-latency aggregation for massive datasets.
It provided a low learning curve by supporting Druid SQL and featured automatic bitmap indexing for all columns, including nested JSON keys.
The system decoupled reads from writes, allowing the data team to serve billions of records without impacting the primary transaction databases' resources.

Data Ingestion: Message Publishing over CDC

The team chose a message publishing approach via Kafka rather than Change Data Capture (CDC) to minimize domain dependency.
In this model, domain teams publish finalized data packets, reducing the data team's need to maintain complex internal business logic for over 20 different payment methods.
This strategy simplified system dependencies and leveraged Druid’s ability to automatically index incoming JSON fields.

Infrastructure and Cost Optimization in AWS

The architecture separates computing and storage, using AWS S3 for deep storage to keep costs low.
Performance was optimized by using instances with high-performance local storage instead of network-attached EBS, resulting in up to 9x faster I/O.
The team utilized Spot Instances for development and testing environments, contributing to a monthly cloud cost reduction of approximately 50 million KRW.

Operational Challenges and Druid’s Limitations

Idempotency and Consistency: Druid struggled with native idempotency, requiring complex "Merge on Read" logic to handle duplicate messages or state changes.
Data Fragmentation: Transaction cancellations often targeted old partitions, causing fragmentation; the team implemented a 60-second detection process to trigger automatic compaction.
Join Constraints: While Druid supports joins, its capabilities are limited, making it difficult to link complex lifecycles across payment, purchase, and settlement domains.

Hybrid Search and Rollup Performance

To ensure high-speed lookups across 10 billion records, a hybrid architecture was built: Elasticsearch handles specific keyword searches to retrieve IDs, which are then used to fetch full details from Druid.
Druid’s "Rollup" feature was utilized to pre-aggregate data at ingestion time.
Implementing Rollup reduced average query response times from tens of seconds to under 1 second, representing a 99% performance improvement for aggregate views.

Moving Toward StarRocks

To solve Druid's limitations regarding idempotency and multi-table joins, Toss Payments began transitioning to StarRocks.
StarRocks provides a more stable environment for managing inconsistent events and simplifies the data flow by aligning with existing analytical infrastructure.
This shift supports the need for a "Unified Ledger" that can track the entire lifecycle of a transaction—from payment to net profit—across disparate database sources.