starrocks

2 posts

toss

Customers Never Wait: How to (opens in new tab)

Toss Payments addressed the challenge of serving rapidly growing transaction data within a microservices architecture (MSA) by evolving their data platform from simple Elasticsearch indexing to a robust CQRS pattern. While Apache Druid initially provided high-performance time-series aggregation and significant cost savings, the team eventually integrated StarRocks to overcome limitations in data consistency and complex join operations. This architectural journey highlights the necessity of balancing real-time query performance with operational scalability and domain decoupling. ### Transitioning to MSA and Early Search Solutions * The shift from a monolithic structure to MSA decoupled application logic but created "data silos" where joining ledgers across domains became difficult. * The initial solution utilized Elasticsearch to index specific fields for merchant transaction lookups and basic refunds. * As transaction volumes doubled between 2022 and 2024, the need for complex OLAP-style aggregations led to the adoption of a CQRS (Command Query Responsibility Segregation) architecture. ### Adopting Apache Druid for Time-Series Data * Druid was selected for its optimization toward time-series data, offering low-latency aggregation for massive datasets. * It provided a low learning curve by supporting Druid SQL and featured automatic bitmap indexing for all columns, including nested JSON keys. * The system decoupled reads from writes, allowing the data team to serve billions of records without impacting the primary transaction databases' resources. ### Data Ingestion: Message Publishing over CDC * The team chose a message publishing approach via Kafka rather than Change Data Capture (CDC) to minimize domain dependency. * In this model, domain teams publish finalized data packets, reducing the data team's need to maintain complex internal business logic for over 20 different payment methods. * This strategy simplified system dependencies and leveraged Druid’s ability to automatically index incoming JSON fields. ### Infrastructure and Cost Optimization in AWS * The architecture separates computing and storage, using AWS S3 for deep storage to keep costs low. * Performance was optimized by using instances with high-performance local storage instead of network-attached EBS, resulting in up to 9x faster I/O. * The team utilized Spot Instances for development and testing environments, contributing to a monthly cloud cost reduction of approximately 50 million KRW. ### Operational Challenges and Druid’s Limitations * **Idempotency and Consistency:** Druid struggled with native idempotency, requiring complex "Merge on Read" logic to handle duplicate messages or state changes. * **Data Fragmentation:** Transaction cancellations often targeted old partitions, causing fragmentation; the team implemented a 60-second detection process to trigger automatic compaction. * **Join Constraints:** While Druid supports joins, its capabilities are limited, making it difficult to link complex lifecycles across payment, purchase, and settlement domains. ### Hybrid Search and Rollup Performance * To ensure high-speed lookups across 10 billion records, a hybrid architecture was built: Elasticsearch handles specific keyword searches to retrieve IDs, which are then used to fetch full details from Druid. * Druid’s "Rollup" feature was utilized to pre-aggregate data at ingestion time. * Implementing Rollup reduced average query response times from tens of seconds to under 1 second, representing a 99% performance improvement for aggregate views. ### Moving Toward StarRocks * To solve Druid's limitations regarding idempotency and multi-table joins, Toss Payments began transitioning to StarRocks. * StarRocks provides a more stable environment for managing inconsistent events and simplifies the data flow by aligning with existing analytical infrastructure. * This shift supports the need for a "Unified Ledger" that can track the entire lifecycle of a transaction—from payment to net profit—across disparate database sources.

naver

Iceberg Low-Latency Queries with Materialized Views (opens in new tab)

This technical session from NAVER ENGINEERING DAY 2025 explores the architectural journey of building a low-latency query system for real-time transaction reports. The project focuses on resolving the tension between high data freshness, massive scalability, and rapid response times for complex, multi-dimensional filtering. By leveraging Apache Iceberg in conjunction with StarRocks’ materialized views, the team established a performant data pipeline that meets the demands of modern business intelligence. ### Challenges in Real-Time Transaction Reporting * **Query Latency vs. Data Freshness:** Traditional architectures often struggle to provide immediate visibility into transaction data while maintaining sub-second query speeds across diverse filter conditions. * **High-Dimensional Filtering:** Users require the ability to query reports based on numerous variables, necessitating an engine that can handle complex aggregations without pre-defining every possible index. * **Scalability Requirements:** The system must handle increasing transaction volumes without degrading performance or requiring significant manual intervention in the underlying storage layer. ### Optimized Architecture with Iceberg and StarRocks * **Apache Iceberg Integration:** Iceberg serves as the open table format, providing a reliable foundation for managing large-scale data snapshots and ensuring consistency during concurrent reads and writes. * **StarRocks for Query Acceleration:** The team selected StarRocks as the primary OLAP engine to take advantage of its high-speed vectorized execution and native support for Iceberg tables. * **Spark-Based Processing:** Apache Spark is utilized for the initial data ingestion and transformation phases, preparing the transaction data for efficient storage and downstream consumption. ### Enhancing Performance via Materialized Views * **Pre-computed Aggregations:** By implementing Materialized Views, the system pre-calculates intensive transaction summaries, significantly reducing the computational load during active user queries. * **Automatic Query Rewrite:** The architecture utilizes StarRocks' ability to automatically route queries to the most efficient materialized view, ensuring that even ad-hoc reports benefit from pre-computed results. * **Balanced Refresh Strategies:** The research focused on optimizing the refresh intervals of these views to maintain high "freshness" while minimizing the overhead on the cluster resources. The adoption of a modern lakehouse architecture combining Apache Iceberg with a high-performance OLAP engine like StarRocks is a recommended strategy for organizations dealing with high-volume, real-time reporting. This approach effectively decouples storage and compute while providing the low-latency response times necessary for interactive data analysis.