토스 / elasticsearch

2 posts

toss

고객은 절대 기다려주지 않는다: 빠른 데이터 서빙으로 고객 만족도를 수직 상승 시키는 법 (opens in new tab)

Toss Payments addressed the challenge of serving rapidly growing transaction data within a microservices architecture (MSA) by evolving their data platform from simple Elasticsearch indexing to a robust CQRS pattern. While Apache Druid initially provided high-performance time-series aggregation and significant cost savings, the team eventually integrated StarRocks to overcome limitations in data consistency and complex join operations. This architectural journey highlights the necessity of balancing real-time query performance with operational scalability and domain decoupling. ### Transitioning to MSA and Early Search Solutions * The shift from a monolithic structure to MSA decoupled application logic but created "data silos" where joining ledgers across domains became difficult. * The initial solution utilized Elasticsearch to index specific fields for merchant transaction lookups and basic refunds. * As transaction volumes doubled between 2022 and 2024, the need for complex OLAP-style aggregations led to the adoption of a CQRS (Command Query Responsibility Segregation) architecture. ### Adopting Apache Druid for Time-Series Data * Druid was selected for its optimization toward time-series data, offering low-latency aggregation for massive datasets. * It provided a low learning curve by supporting Druid SQL and featured automatic bitmap indexing for all columns, including nested JSON keys. * The system decoupled reads from writes, allowing the data team to serve billions of records without impacting the primary transaction databases' resources. ### Data Ingestion: Message Publishing over CDC * The team chose a message publishing approach via Kafka rather than Change Data Capture (CDC) to minimize domain dependency. * In this model, domain teams publish finalized data packets, reducing the data team's need to maintain complex internal business logic for over 20 different payment methods. * This strategy simplified system dependencies and leveraged Druid’s ability to automatically index incoming JSON fields. ### Infrastructure and Cost Optimization in AWS * The architecture separates computing and storage, using AWS S3 for deep storage to keep costs low. * Performance was optimized by using instances with high-performance local storage instead of network-attached EBS, resulting in up to 9x faster I/O. * The team utilized Spot Instances for development and testing environments, contributing to a monthly cloud cost reduction of approximately 50 million KRW. ### Operational Challenges and Druid’s Limitations * **Idempotency and Consistency:** Druid struggled with native idempotency, requiring complex "Merge on Read" logic to handle duplicate messages or state changes. * **Data Fragmentation:** Transaction cancellations often targeted old partitions, causing fragmentation; the team implemented a 60-second detection process to trigger automatic compaction. * **Join Constraints:** While Druid supports joins, its capabilities are limited, making it difficult to link complex lifecycles across payment, purchase, and settlement domains. ### Hybrid Search and Rollup Performance * To ensure high-speed lookups across 10 billion records, a hybrid architecture was built: Elasticsearch handles specific keyword searches to retrieve IDs, which are then used to fetch full details from Druid. * Druid’s "Rollup" feature was utilized to pre-aggregate data at ingestion time. * Implementing Rollup reduced average query response times from tens of seconds to under 1 second, representing a 99% performance improvement for aggregate views. ### Moving Toward StarRocks * To solve Druid's limitations regarding idempotency and multi-table joins, Toss Payments began transitioning to StarRocks. * StarRocks provides a more stable environment for managing inconsistent events and simplifies the data flow by aligning with existing analytical infrastructure. * This shift supports the need for a "Unified Ledger" that can track the entire lifecycle of a transaction—from payment to net profit—across disparate database sources.

toss

100년 가는 프론트엔드 코드, SDK (opens in new tab)

Toss Payments evolved its Payment SDK to solve the inherent complexities of integrating payment systems, where developers must navigate UI implementation, security flows, and exception handling. By transitioning from V1 to V2, the team moved beyond simply providing a library to building a robust, architecture-driven system that ensures stability and scalability across diverse merchant environments. The core conclusion is that a successful SDK must be treated as a critical infrastructure layer, relying on modular design and deep observability to handle the unpredictable nature of third-party runtimes. ## The Unique Challenges of SDK Development * SDK code lives within the merchant's runtime environment, meaning it shares the same lifecycle and performance constraints as the merchant’s own code. * Internal logging can inadvertently create bottlenecks; for instance, adding network logs to a frequently called method can lead to "self-DDoS" scenarios that crash the merchant's payment page. * Type safety is a major hurdle, as merchants may pass unexpected data types (e.g., a number instead of a string), causing fatal runtime errors like `startsWith is not a function`. * The SDK acts as a bridge for technical communication, requiring it to function as both an API consumer for internal systems and an API provider for external developers. ## Ensuring Stability through Observability * To manage the unpredictable ways merchants use the SDK, Toss implemented over 300 unit tests and 500 E2E integration tests based on real-world use cases. * The team utilizes a "Global Trace ID" to track a single payment journey across both the frontend and backend, allowing for seamless debugging across the entire system. * A custom Monitoring CLI was developed to compare payment success rates before and after deployments, categorized by merchant and runtime environment (e.g., PC Chrome vs. Android WebView). * This observability infrastructure enables the team to quickly identify edge-case failures—such as a specific merchant's checkout failing only on mobile WebViews—which are often missed by standard QA processes. ## Scaling with Modular Architecture * To avoid "if-statement hell" caused by merchant-specific requirements (e.g., fixing installment months or custom validation for a specific store), Toss moved to a "Lego-block" architecture. * The SDK is organized into three distinct layers based on the "reason for change" principle: * **Public Interface Layer:** Manages the contract with the merchant, validating inputs and translating them into internal domain models. * **Domain Layer:** Encapsulates core business logic and payment policies, keeping them isolated from external changes. * **External Service Layer:** Handles dependencies like Server APIs and Web APIs, ensuring technical shifts don't leak into the business logic. * This separation allows the team to implement custom merchant logic by swapping specific blocks without modifying the core codebase, reducing the risk of regressions and lowering maintenance costs. For developers building SDKs or integration tools, the shift from monolithic logic to a layered, observable architecture is essential. Prioritizing the separation of domain logic from public interfaces and investing in environment-specific monitoring allows for a highly flexible product that remains stable even as the client-side environment grows increasingly complex.