토스 / database-design

11 posts

toss

StarRocks 운영기: Resource Group으로 멀티테넌트 워크로드 격리하기 (opens in new tab)

안녕하세요, 토스 Data Online Processing(DOP) 팀의 이유진입니다. 토스에서는 서비스 조회와 분석 쿼리를 한 플랫폼에서 빠르게 처리하기 위해 StarRocks를 실시간 OLAP 엔진으로 도입했어요. 하나의 클러스터 위에 성격이 다른 워크로드가 쌓이다 보니, "누구의 쿼리를 먼저 보호할 것인가"가 운영의 핵심 질문이 되었습니다. 이 글(1편)에서는 StarRocks 클러스터를 운영하면서 겪었던 이야기, 그중에서도 Resource Group으로 워크로드를 분류하고 CPU 우선순위…

toss

양자컴퓨터 시대에 대비한 양자내성암호 적용, 왜 10년 먼저 서비스에 적용했을까? (opens in new tab)

토스테크 독자 여러분, 안녕하세요. 토스페이먼츠 Head of Technology 하태호입니다. 레거시 개편 시리즈를 연재하면서, 많이 들었던 질문이 있습니다. "20년 레거시를 개편하면서 가장 어려웠던 것은 무엇이었나요?" 기술적으로 어려운 과정이 많지 않았냐는 질문이 많았습니다. 힘들고 어려운 순간도 많았지만, 개편에 참여한 엔지니어분들의 깊은 고민과 헌신으로 불가능해 보이는 개선 작업을 하나씩 해내 주셔서 많은 부분을 개선할 수 있었습니다. 하지만 진짜 어려운 건 수만 개 가맹점과 보조를…

toss

토스플레이스 데이터봇 ‘판다(PANDA)’를 소개합니다 : 모든 팀원이 데이터 전문가처럼 일하는 방법 (opens in new tab)

안녕하세요. 토스플레이스에서 ‘판다(PANDA)’를 기획하고 만든 Data Analysis Team Leader 김윤아, Data Analytics Engineer 정이을 입니다. 업무를 하다 보면 이런 순간이 있지 않으셨나요? “지금 이 데이터만 빠르게 확인할 수 있으면 좋을텐데.” 대시보드를 찾아 들어가거나, 누군가에게 요청을 남기고 기다리지 않고 필요한 데이터를 바로 꺼내 쓸 수 있다면 어떨까요? 토스플레이스 Data 조직은 AI를 활용해 이런 환경을 만들어 데이터 민주주의를 실현할 수 있…

toss

Apache Flink + RocksDB 튜닝으로 광고 Frequency Capping 실시간 집계를 일주일까지 확장하기 (opens in new tab)

안녕하세요, 토스 Data Service Platform Team 이승민, 최원용입니다. 저희 팀에서는 광고 노출 횟수의 슬라이딩 집계를 제공하고 있습니다. 짧은 구간(1분~1시간)은 Flink로, 장기 구간은 Airflow 배치로 운영하는 구조였는데요. 이 글은 장기 구간까지 Flink로 확장하면서 겪은 과정을 기록한 것입니다. 사용자가 광고를 얼마나 봤는지 1분부터 7일 단위까지 실시간으로 집계하고, 서빙 시점에 단일 조회로 제공하는 시스템을 만든 이야기예요. 집계가 부정확하면 광고주 예산이…

toss

Metric Review, 실행을 이끌다 (opens in new tab)

안녕하세요, 토스플레이스에서 Data Platform Team을 이끌고 있는 박종익입니다. "인사이트는 분명히 나왔는데, 왜 실행은 느릴까요?" 데이터 조직에 있다 보면 이 질문을 자주 마주하게 됩니다. 분석은 쌓이고, 대시보드는 채워지는데 — 정작 제품이나 사업에 직접적인 변화가 일어나는 속도는 기대에 미치지 못하는 경우가 많아요. 저희도 같은 고민을 오랫동안 해왔습니다. 그 고민에서 시작한 것이 바로 Metric Review입니다. 오늘은 저희가 왜 Metric Review를 시작했고, 어떻…

toss

LLM을 이용한 서비스 취약점 분석 자동화 #2 (opens in new tab)

*이 글은 연구 개발망에서 진행된 내용을 바탕으로 합니다. 안녕하세요. 토스 Security Researcher 표상영입니다. 지난 글에서는 LLM을 이용해 서비스 취약점 분석을 자동화하면서 마주했던 문제점과 그에 대한 해결책들을 간단히 소개드렸습니다. 이전 글을 작성한 시점부터 벌써 3개월이 지났는데요. 불과 몇 달 사이에 AI의 취약점 분석 능력은 정말 높은 수준으로 올라왔습니다. 이렇게 가파른 기술 발전 속도에 따라, AI를 대하는 저의 자세와 생각도 많이 바뀌게 되었어요. 이번 글에서는…

toss

외국인 유저 리서치: 캐나다인 "B"씨는 왜 토스 인증에 실패했을까 (opens in new tab)

혹시 외국인이 보는 한국의 금융 시스템이 어떤지 아시나요? 미국의 유명 커뮤니티 Reddit에서 “Korean Banking”을 검색해 보면, 외국인들이 느끼는 한국 금융 시스템의 인상을 그대로 볼 수 있어요. 누군가의 도움 없이는 이해하기 어렵고, 전반적인 경험도 복잡하게 느껴진다고 해요. 그래서일까요? 토스에 가입했더라도 제대로 사용하지 못하는 외국인 사용자들이 많았어요. “모두를 위한 금융”이 토스의 비전이라면, 외국인이라고 해서 그 대상에서 제외되어서는 안된다고 생각했어요. 외국인도 편하…

toss

레거시 정산 개편기: 신규 시스템 투입 여정부터 대규모 배치 운영 노하우까지 (opens in new tab)

Toss Payments recently overhauled its 20-year-old legacy settlement system to overcome deep-seated technical debt and prepare for massive transaction growth. By shifting from monolithic SQL queries and aggregated data to a granular, object-oriented architecture, the team significantly improved system maintainability, traceability, and batch processing performance. The transition focused on breaking down complex dependencies and ensuring that every transaction is verifiable and reproducible. ### Replacing Monolithic SQL with Object-Oriented Logic * The legacy system relied on a "giant common query" filled with nested `DECODE`, `CASE WHEN`, and complex joins, making it nearly impossible to identify the impact of small changes. * The team applied a "Divide and Conquer" strategy, splitting the massive query into distinct domains and refined sub-functions. * Business logic was moved from the database layer into Kotlin-based objects (e.g., `SettlementFeeCalculator`), making business rules explicit and easier to test. * This modular approach allowed for "Incremental Migration," where specific features (like exchange rate conversions) could be upgraded to the new system independently. ### Improving Traceability through Granular Data Modeling * The old system stored data in an aggregated state (Sum), which prevented developers from tracing errors back to specific transactions or reusing data for different reporting needs. * The new architecture manages data at the minimum transaction unit (1:1), ensuring that every settlement result corresponds to a specific transaction. * "Setting Snapshots" were introduced to store the exact contract conditions (fee rates, VAT status) at the time of calculation, allowing the system to reconstruct the context of past settlements. * A state-based processing model was implemented to enable selective retries for failed transactions, significantly reducing recovery time compared to the previous "all-or-nothing" transaction approach. ### Optimizing High-Resolution Data and Query Performance * Managing data at the transaction level led to an explosion in data volume, necessitating specialized database strategies. * The team implemented date-based Range Partitioning and composite indexing on settlement dates to maintain high query speeds despite the increased scale. * To balance write performance and read needs, they created "Query-specific tables" that offload the processing burden from the main batch system. * Complex administrative queries were delegated to a separate high-performance data serving platform, maintaining a clean separation between core settlement logic and flexible data analysis. ### Resolving Batch Performance and I/O Bottlenecks * The legacy batch system struggled with long processing times that scaled poorly with transaction growth due to heavy I/O and single-threaded processing. * I/O was minimized by caching merchant contract information in memory at the start of a batch step, eliminating millions of redundant database lookups. * The team optimized the `ItemProcessor` in Spring Batch by implementing bulk lookups (using a Wrapper structure) to handle multiple records at once rather than querying the database for every individual item. This modernization demonstrates that scaling a financial system requires moving beyond "convenient" aggregations toward a granular, state-driven architecture. By decoupling business logic from the database and prioritizing data traceability, Toss Payments has built a foundation capable of handling the next generation of transaction volumes.

toss

레거시 결제 원장을 확장 가능한 시스템으로 (opens in new tab)

Toss Payments successfully modernized a 20-year-old legacy payment ledger by transitioning to a decoupled, MySQL-based architecture designed for high scalability and consistency. By implementing strategies like INSERT-only immutability and event-driven domain isolation, they overcame structural limitations such as the inability to handle split payments. Ultimately, the project demonstrates that robust system design must be paired with resilient operational recovery mechanisms to manage the complexities of large-scale financial migrations. ### Legacy Ledger Challenges * **Inconsistent Schemas:** Different payment methods used entirely different table structures; for instance, a table named `REFUND` unexpectedly contained only account transfer data rather than all refund types. * **Domain Coupling:** Multiple domains (settlement, accounting, and payments) shared the same tables and columns, meaning a single schema change required impact analysis across several teams. * **Structural Limits:** A rigid 1:1 relationship between a payment and its method prevented the implementation of modern features like split payments or "Dutch pay" models. ### New Ledger Architecture * **Data Immutability:** The system shifted from updating existing rows to an **INSERT-only** principle, ensuring a reliable audit trail and preventing database deadlocks. * **Event-Driven Decoupling:** Instead of direct database access, the system uses Kafka to publish payment events, allowing independent domains to consume data without tight coupling. * **Payment-Approval Separation:** By separating the "Payment" (the transaction intent) from the "Approval" (the specific financial method), the system now supports multiple payment methods per transaction. ### Safe Migration and Data Integrity * **Asynchronous Mirroring:** To maintain zero downtime, data was initially written to the legacy system and then asynchronously loaded into the new MySQL ledger. * **Resource Tuning:** Developers used dedicated migration servers within the same AWS Availability Zone to minimize latency and implemented **Bulk Inserts** to handle hundreds of millions of rows efficiently. * **Verification Batches:** A separate batch process ran every five minutes against a Read-Only (RO) database to identify and correct any data gaps caused by asynchronous processing failures. ### Operational Resilience and Incident Response * **Query Optimization:** During a load spike, the MySQL optimizer chose "Full Scans" over indexes; the team resolved this by implementing SQL hints and utilizing a 5-version Docker image history for rapid rollbacks. * **Network Cancellation:** To handle timeouts between Toss and external card issuers, the system uses specific logic to automatically send cancellation requests and synchronize states. * **Timeout Standardization:** Discrepancies between microservices were resolved by calculating the maximum processing time of approval servers and aligning all upstream timeout settings to prevent merchant response mismatches. * **Reliable Event Delivery:** While using the **Outbox pattern** for events, the team added log-based recovery (Elasticsearch and local disk) and idempotency keys in event headers to handle both missing and duplicate messages. For organizations tackling significant technical debt, this transition highlights that initial design is only half the battle. True system reliability comes from building "self-healing" structures—such as automated correction batches and standardized timeout chains—that can survive the unpredictable nature of live production environments.

toss

토스플레이스 사일로 QA로 일한다는 것 (opens in new tab)

Toss Place implements a dual-role QA structure where managers are embedded directly within product Silos from the initial planning stages to final deployment. This shift moves QA from a final-stage bottleneck to a proactive partner that enhances delivery speed and stability through deep historical context and early risk mitigation. Consequently, the organization has transitioned to a culture where quality is viewed as a shared team responsibility rather than a siloed functional task. ### Integrating QA into Product Silos * QA managers belong to both a central functional team and specific product units (Silos) to ensure they are involved in the entire product lifecycle. * Participation begins at the OKR design phase, allowing QA to align testing strategies with specific product intentions and business goals. * Early involvement enables accurate risk assessment and scope estimation, preventing the "shallow testing" that often occurs when QA only sees the final product. ### Optimizing Spec Reviews and Sanity Testing * The team introduced a structured flow consisting of Spec Reviews followed by Q&A sessions to reduce repetitive discussions and information gaps. * All specification changes are centralized in shared design tools (such as Deus) or messenger threads to ensure transparency across all roles. * "Sanity Test" criteria were established where developers and QA agree on "Happy Case" validations and minimum spec requirements before development begins, ensuring everyone starts from the same baseline. ### Collaborative Live Monitoring * Post-release checklists were developed to involve the entire Silo in live monitoring, overcoming the limitations of having a single QA manager per unit. * This collaborative approach encourages non-technical roles to interact with the live product, reinforcing the culture that quality is a collective team responsibility. ### Streamlining Issue Tracking and Communication * The team implemented a "Send to Notion" workflow to instantly capture messenger-based feedback and ideas into a structured, prioritized backlog. * To reduce communication fragmentation, they transitioned from Jira to integrated Messenger Lists and Canvases, which allowed for centralized discussions and faster issue resolution. * Backlogs are prioritized based on user experience impact and release urgency, ensuring that critical bugs are addressed while minor improvements are tracked for future cycles. The success of these initiatives demonstrates that QA effectiveness is driven by integration and autonomy rather than rigid adherence to specific tools. To achieve both high velocity and high quality, organizations should empower QA professionals to act as product peers who can flexibly adapt their processes to the unique needs and data-driven goals of their specific product teams.