토스 | Techlist.io

toss Jan 21, 2026

Will developers be replaced by AI? (opens in new tab)

The current AI hype cycle is a significant economic bubble where massive infrastructure investments of $560 billion far outweigh the modest $35 billion in generated revenue. However, drawing parallels to the 1995 dot-com era, the author argues that while short-term expectations are overblown, the long-term transformation of the developer role is inevitable. The conclusion is that developers won't be replaced but will instead evolve into "Code Creative Directors" who manage AI through the lens of technical abstraction and delegation. ### The Economic Bubble and Amara’s Law * The industry is experiencing a 16:1 imbalance between AI investment and revenue, with 95% of generative AI implementations reportedly failing to deliver clear efficiency improvements. * Amara’s Law suggests that we are overestimating AI's short-term impact while potentially underestimating its long-term necessity. * Much of the current "AI-driven" job market contraction is actually a result of companies cutting personnel costs to fund expensive GPU infrastructure and AI research. ### Jevons Paradox and the Evolution of Roles * Jevons Paradox indicates that as the "cost" of producing code drops due to AI efficiency, the total demand for software and the complexity of systems will paradoxically increase. * The developer’s identity is shifting from "code producer" to "system architect," focusing on agent orchestration, result verification, and high-level design. * AI functions as a "power tool" similar to game engines, allowing small teams to achieve professional-grade output while amplifying the capabilities of senior engineers. ### Delegation as a Form of Abstraction * Delegating a task to AI is an act of "work abstraction," which involves choosing which low-level details a developer can afford to ignore. * The technical boundary of what is "hard to delegate" is constantly shifting; for example, a complex RAG (Retrieval-Augmented Generation) pipeline built for GPT-4 might become obsolete with the release of a more capable model like GPT-5. * The focus for developers must shift from "what is easy to delegate" to "what *should* be delegated," distinguishing between routine boilerplate and critical human judgment. ### The Risks of Premature Abstraction * Abstraction does not eliminate complexity; it simply moves it into the future. If the underlying assumptions of an AI-generated system change, the abstraction "leaks" or breaks. * Sudden shifts in scaling (traffic surges), regulation (GDPR updates), or security (zero-day vulnerabilities) expose the limitations of AI-delegated work, requiring senior intervention. * Poorly managed AI delegation can lead to "abstraction debt," where the cost of fixing a broken AI-generated system exceeds the cost of having written it manually from the start. To thrive in this environment, developers should embrace AI not as a replacement, but as a layer of abstraction. Success requires mastering the ability to define clear boundaries for AI—delegating routine CRUD operations and boilerplate while retaining human control over architecture, security, and complex business logic.

ai llm gen-ai prompt-engineering+3

toss Jan 20, 2026

How I Tole Down Our Legacy (opens in new tab)

Toss Payments modernized its inherited legacy infrastructure by building an OpenStack-based private cloud to operate alongside public cloud providers in an Active-Active hybrid configuration. By overcoming extreme technical debt—including servers burdened with nearly 2,000 manual routing entries—the team achieved a cloud-agnostic deployment environment that ensures high availability and cost efficiency. The transformation demonstrates how a small team can successfully implement complex open-source infrastructure through automation and the rigorous technical internalization of Cluster API and OpenStack. ### The Challenge of Legacy Networking - The inherited infrastructure relied on server-side routing rather than network equipment, meaning every server carried its own routing table. - Some legacy servers contained 1,997 individual routing entries, making manual management nearly impossible and preventing efficient scaling. - Initial attempts to solve this via public cloud (AWS) faced limitations, including rising costs due to exchange rates, lack of deep visibility for troubleshooting, and difficulties in disaster recovery (DR) configuration between public and on-premise environments. ### Scaling OpenStack with a Two-Person Team - Despite having only two engineers with no prior OpenStack experience, the team chose the open-source platform to maintain 100% control over the infrastructure. - The team internalized the technology by installing three different versions of OpenStack dozens of times and simulating various failure scenarios. - Automation was prioritized using Ansible and Terraform to manage the lifecycle of VMs and load balancers, enabling new instance creation in under 10 seconds. - Deep technical tuning was applied, such as modifying the source code of the Octavia load balancer to output custom log formats required for their specific monitoring needs. ### High Availability and Monitoring Strategy - To ensure reliability, the team built three independent OpenStack clusters operating in an Active-Active configuration. - This architecture allows for immediate traffic redirection if a specific cluster fails, minimizing the impact on service availability. - A comprehensive monitoring stack was implemented using Zabbix, Prometheus, Mimir, and Grafana to collect and visualize every essential metric across the private cloud. ### Managing Kubernetes with Cluster API - To replicate the convenience of Public Cloud PaaS (like EKS), the team implemented Cluster API to manage the Kubernetes lifecycle. - Cluster API treats Kubernetes clusters themselves as resources within a management cluster, allowing for standardized and rapid deployment across the private environment. - This approach ensures that developers can deploy applications without needing to distinguish between the underlying cloud providers, fulfilling the goal of "cloud-agnostic" infrastructure. ### Practical Recommendation For organizations dealing with massive technical debt or high public cloud costs, the Toss Payments model suggests that a "Private-First" hybrid approach is viable even with limited headcount. The key is to avoid proprietary black-box solutions and instead invest in the technical internalization of open-source tools like OpenStack and Cluster API, backed by a "code-as-infrastructure" philosophy to ensure scalability and reliability.

k8s high-availability devops infrastructure-as-code+4

toss Jan 20, 2026

Toss Income QA Platform (opens in new tab)

Toss's QA team developed an internal "QA Platform" to solve the high barrier to entry associated with using Swagger for manual testing and data setup. By transforming complex, multi-step API calls into a simple, button-based GUI, the team successfully empowered non-QA members to perform self-verification. This shift effectively moved quality assurance from a final-stage bottleneck to a continuous, integrated part of the development process, significantly increasing product delivery speed. ### Lowering the Barrier to Test APIs * Existing Swagger documentation was functionally complete but difficult for developers or planners to use due to the need for manual JSON editing and sequential API execution. * The QA Platform does not create new APIs; instead, it provides a GUI layer over existing Swagger Test APIs to make them accessible without technical documentation. * The system offers two distinct interfaces: "Normal Mode" for simplified, one-click testing and "Swagger Mode" for granular control over request bodies and parameters. ### From Manual Clicks to Automation and Management * Phase 1 focused on visual accessibility, allowing users to trigger complex data states via buttons rather than manual API orchestration. * Phase 2 integrates existing automation scripts into the platform, removing the need for local environment setups and allowing anyone to execute automated test suites. * The final phase aims to transition into a comprehensive Test Management System (TMS) tailored to the team's specific workflow, reducing reliance on third-party external tools. ### Redefining Quality as a Design Choice * By reducing the time and mental effort required to run a test, verification became a frequent, daily habit for the entire product team rather than a chore for the QA department. * Lowering the "cost" of testing replaced guesswork with data-driven confidence, allowing the team to move faster during development. * This initiative reflects a philosophical shift where quality is no longer viewed as a final checklist item but as a core structural element designed into the development lifecycle. The primary takeaway for engineering teams is that the speed of a product is often limited by the friction of its testing process. By building internal tools that democratize testing capabilities—making them available to anyone regardless of their technical role—organizations can eliminate verification delays and foster a culture where quality is a shared responsibility.

test-automation quality-assurance swagger api-testing+3

toss Jan 15, 2026

Creating the New Face of Toss (opens in new tab)

Toss redesigned its brand persona graphics to transition from simple, child-like icons to more professional and inclusive human figures that better represent the brand's identity. This update aims to project a more trustworthy and intelligent image while ensuring the visual language is prepared for a global, multi-cultural audience. By balancing iconic simplicity with diverse representation, the new design system maintains brand consistency across various screen sizes and service contexts. ### Refining Proportions for Professionalism * The team adjusted the vertical facial ratio to move away from a "child-like" impression, finding a balance that suggests maturity and intelligence without losing the icon's friendly nature. * The placement of the eyes, nose, and mouth was meticulously tuned to maintain an iconic look while increasing the perceived level of trust. * Structural improvements were made to the body, specifically refining the curves where the neck and shoulders meet to eliminate the unnatural "blocky" feel of previous versions. * A short turtleneck was selected as the default attire to provide a clean, professional, and sophisticated look that works across different UI environments. ### Achieving Gender-Neutral Hairstyles * The design team aimed for "neutrality" in hair design to prevent the characters from being categorized into specific gender roles. * Several iterations were tested, including high-density detailed styles (which were too complex) and simple line-separated styles (which lacked visual density when scaled up). * The final selection focuses on a clean silhouette that follows the head line while adding enough volume to ensure the graphic feels complete and high-quality at any size. ### Implementing Universal Skin Tones and Diversity * To support Toss's expansion into global markets, the team moved away from a single skin tone that could be interpreted as a specific race. * While a "neutral yellow" (similar to standard emojis) was considered, it was ultimately rejected because it felt inconsistent and jarring when displayed in larger formats within the app. * Instead of a single "neutral" color, the team defined a palette of five distinct skin tones based on universal emoji standards. * New guidelines were established to mix these different skin tones in scenes with multiple characters, fostering a sense of inclusivity and representation that reflects a diverse user base. The evolution of the Toss persona illustrates that as a service grows, its visual language must move beyond simple aesthetics to address broader values like trust and inclusivity. Moving forward, the design system will continue to expand to ensure that no user feels excluded by age, gender, or race.

design-system brand-identity graphic-design inclusive-design+3

toss Jan 8, 2026

Managing Thousands of API/Batch Servers (opens in new tab)

Toss Payments manages thousands of API and batch server configurations that handle trillions of won in transactions, where a single typo in a JVM setting can lead to massive financial infrastructure failure. To solve the risks associated with manual "copy-paste" workflows and configuration duplication, the team developed a sophisticated system that treats configuration as code. By implementing layered architectures and dynamic templates, they created a testable, unified environment capable of managing complex hybrid cloud setups with minimal human error. ## Overlay Architecture for Hierarchical Control * The team implemented a layered configuration system consisting of `global`, `cluster`, `phase`, and `application` levels. * Settings are resolved by priority, where lower-level layers override higher-level defaults, allowing servers to inherit common settings while maintaining specific overrides. * This structure allows the team to control environment-specific behaviors, such as disabling canary deployments in development environments, from a single centralized directory. * The directory structure maps files 1:1 to their respective layers, ensuring that naming conventions drive the CI/CD application process. ## Solving Duplication with Template Patterns * Standard YAML overlays often fail when dealing with long strings or arrays, such as `JVM_OPTION`, because changing a single value usually requires redefining the entire block. * To prevent the proliferation of nearly identical environment variables, the team introduced a template pattern using placeholders like `{{MAX_HEAP}}`. * Developers can modify specific parameters at the application layer while the core string remains defined at the global layer, significantly reducing the risk of typos. * This approach ensures that critical settings, like G1GC parameters or heap region sizes, remain consistent across the infrastructure unless explicitly changed. ## Dynamic and Conditional Configuration Logic * The system allows for "evolutionary" configurations where Python scripts can be injected to generate dynamic values, such as random JMX ports or data fetched from remote APIs. * Advanced conditional logic was added to handle complex deployment scenarios, enabling environment variables to change their values automatically based on the target cluster name (e.g., different profiles for AWS vs. IDC). * By treating configuration as a living codebase, the team can adapt to new infrastructure requirements without abandoning their core architectural principles. ## Reliable Batch Processing through Simplicity * For batch operations handling massive settlement volumes, the team prioritized "appropriate technology" and simplicity to minimize failure points. * They chose Jenkins for its low learning curve and reliability, despite its lack of native GitOps support. * To address inconsistencies in manual UI entries and varying Java versions across machines, they standardized the batch infrastructure to ensure that high-stakes financial calculations are executed in a controlled, predictable environment. The most effective way to manage large-scale infrastructure is to transition from static, duplicated configuration files to a dynamic, code-centric system. By combining an overlay architecture for hierarchy and a template pattern for granular changes, organizations can achieve the flexibility needed for hybrid clouds while maintaining the strict safety standards required for financial systems.

aws k8s infrastructure-as-code jenkins+3

toss Jan 7, 2026

Rethinking Design Systems (opens in new tab)

Toss Design System (TDS) argues that as organizations scale, design systems often become a source of friction rather than efficiency, leading teams to bypass them through "forking" or "detaching" components. To prevent this, TDS treats the design system as a product that must adapt to user demand rather than a set of rigid constraints to be enforced. By shifting from a philosophy of control to one of flexible expansion, they ensure that the system remains a helpful tool rather than an obstacle. ### The Limits of Control and System Fragmentation * When a design system is too rigid, product teams often fork packages to make minor adjustments, which breaks the link to central updates and creates UI inconsistencies. * Treating "system bypasses" as user errors is ineffective; instead, they should be viewed as unmet needs in the system's "supply." * The goal of a modern design system should be to reduce the reason to bypass the system by providing natural extension points. ### Comparing Flat and Compound API Patterns * **Flat Pattern:** These components hide internal structures and use props to manage variations (e.g., `title`, `description`). While easy to use, they suffer from "prop bloat" as more edge cases are added, making long-term maintenance difficult. * **Compound Pattern:** This approach provides sub-components (e.g., `Card.Header`, `Card.Body`) for the user to assemble manually. This offers high flexibility for unexpected layouts but increases the learning curve and the amount of boilerplate code required. ### The Hybrid API Strategy * TDS employs a hybrid approach, offering both Flat APIs for common, simple use cases and Compound APIs for complex, customized needs. * Developers can choose a `FlatCard` for speed or a `Compound Card` when they need to inject custom elements like badges or unique button placements. * To avoid the burden of maintaining two separate codebases, TDS uses a "primitive" layer where the Flat API is simply a pre-assembled version of the Compound components. Design systems should function as guardrails that guide developers toward consistency, rather than fences that stop them from solving product-specific problems. By providing flexible architecture that supports exceptions, a system can maintain its relevance and ensure that teams stay within the ecosystem even as their requirements evolve.

design-system react scalability component-api-design+3

toss Dec 23, 2025

Tax Refund Automation: AI (opens in new tab)

At Toss Income, QA Manager Suho Jung successfully automated complex E2E testing for diverse tax refund services by leveraging AI as specialized virtual team members. By shifting from manual coding to a "human-as-orchestrator" model, a single person achieved the productivity of a four-to-five-person automation team within just five months. This approach overcame the inherent brittleness of testing long, React-based flows that are subject to frequent policy changes and external system dependencies. ### Challenges in Tax Service Automation The complexity of tax refund services presented unique hurdles that made traditional manual automation unsustainable: * **Multi-Step Dependencies:** Each refund flow averages 15–20 steps involving internal systems, authentication providers, and HomeTax scraping servers, where a single timing glitch can fail the entire test. * **Frequent UI and Policy Shifts:** Minor UI updates or new tax laws required total scenario reconfigurations, making hard-coded tests obsolete almost immediately. * **Environmental Instability:** Issues such as "Target closed" errors during scraping, differing domain environments, and React-specific hydration delays caused constant test flakiness. ### Building an AI-Driven QA Team Rather than using AI as a simple autocomplete tool, the project assigned specific "personas" to different AI models to handle distinct parts of the lifecycle: * **SDET Agent (Claude Sonnet 4.5):** Acted as the lead developer, responsible for designing the Page Object Model (POM) architecture, writing test logic, and creating utility functions. * **Documentation Specialist:** Automatically generated daily retrospectives and updated technical guides by analyzing daily git commits. * **Git Master:** Managed commit history and PR descriptions to ensure high-quality documentation of the project’s evolution. * **Pair Programmers (Cursor & Codex):** Handled real-time troubleshooting, type errors, and comparative analysis of different test scripts. ### Technical Solutions for React and Policy Logic The team implemented several sophisticated technical strategies to ensure test stability: * **React Interaction Readiness:** To solve "Element is not clickable" errors, they developed a strategy that waits not just for visibility, but for event handlers to bind to the DOM (Hydration). * **Safe Interaction Fallbacks:** A standard `click` utility was created that attempts a Playwright click, then a native keyboard 'Enter' press, and finally a JS dispatch to ensure interactions succeed even during UI transitions. * **Dynamic Consent Flow Utility:** A specialized system was built to automatically detect and handle varying "Terms of Service" agreements across different sub-services (Tax Secretary, Hidden Refund, etc.) through a single unified function. * **Test Isolation:** Automated scripts were used to prevent `userNo` (test ID) collisions, ensuring 35+ complex scenarios could run in parallel without data interference. ### Integrated Feedback and Reporting The automation was integrated directly into internal communication channels to create a tight feedback loop: * **Messenger Notifications:** Every test run sends a report including execution time, test IDs, and environment data to the team's messenger. * **Automated Failure Analysis:** When a test fails, the AI automatically posts the error log, the specific failed step, a tracking EventID, and a screenshot as a thread reply for immediate debugging. * **Human-AI Collaboration:** This structure shifted the QA's role from writing code to discussing failures and policy changes within the messenger threads. The success of this 5-month experiment suggests that for high-complexity environments, the future of QA lies in "AI Orchestration." Instead of focusing on writing selectors, QA engineers should focus on defining problems and managing the AI agents that build the architecture.

ai ai-agent test-automation react+5

toss Dec 23, 2025

Automating Service Vulnerability Analysis Using (opens in new tab)

Toss has developed a high-precision automated vulnerability analysis system by integrating Large Language Models (LLMs) with traditional security testing tools. By evolving their architecture from a simple prompt-based approach to a multi-agent system utilizing open-source models and static analysis, the team achieved over 95% accuracy in threat detection. This project demonstrates that moving beyond a technical proof-of-concept requires solving real-world constraints such as context window limits, output consistency, and long-term financial sustainability. ### Navigating Large Codebases with MCP * Initial attempts to use RAG (Retrieval Augmented Generation) and repository compression tools failed because the LLM could not maintain complex code relationships within token limits. * The team implemented a "SourceCode Browse MCP" (Model Context Protocol) which allows the LLM agent to dynamically query the codebase. * By indexing the code, the agent can perform specific tool calls to find function definitions or variable usages only when necessary, effectively bypassing context window restrictions. ### Ensuring Consistency via SAST Integration * Testing revealed that standalone LLMs produced inconsistent results, often missing known vulnerabilities or generating hallucinations across different runs. * To solve this, the team integrated Semgrep, a Static Application Security Testing (SAST) tool, to identify all potential "Source-to-Sink" paths. * Semgrep was chosen over CodeQL due to its lighter resource footprint and faster execution, acting as a structured roadmap that ensures the LLM analyzes every suspicious input path without omission. ### Optimizing Costs with Multi-Agent Architectures * Analyzing every possible code path identified by SAST tools was prohibitively expensive due to high token consumption. * The workflow was divided among three specialized agents: a Discovery Agent to filter out irrelevant paths, an Analysis Agent to perform deep logic checks, and a Verification Agent to confirm findings. * This "sieve" strategy ensured that the most resource-intensive analysis was only performed on high-probability vulnerabilities, significantly reducing operational costs. ### Transitioning to Open Models for Sustainability * Scaling the system to hundreds of services and daily PRs made proprietary cloud models financially unviable. * After benchmarking models like Llama 3.1 and GPT-OSS, the team selected **Qwen3:30B** for its 100% coverage rate and high true-positive accuracy in vulnerability detection. * To bridge the performance gap between open-source and proprietary models, the team utilized advanced prompt engineering, one-shot learning, and enforced structured JSON outputs to improve reliability. To build a production-ready AI security tool, teams should focus on the synergy between specialized open-source models and traditional static analysis tools. This hybrid approach provides a cost-effective and sustainable way to achieve enterprise-grade accuracy while maintaining full control over the analysis infrastructure.

ai llm mcp vulnerability-analysis+4

toss Dec 23, 2025

Creating Up-to- (opens in new tab)

Managing complex multi-page onboarding funnels often leads to documentation that quickly becomes decoupled from the actual codebase, creating confusion for developers. To solve this, the Toss team developed an automated system that uses static code analysis to generate funnel flowcharts that are never outdated. By treating the source code as the "Source of Truth," they successfully transformed hard-to-track navigation logic into a synchronized, visual map. ### The Limitations of Manual Documentation * Manual diagrams fail to scale when a funnel contains high-frequency branching, such as the 82 distinct conditions found across 39 onboarding pages. * Traditional documentation becomes obsolete within days of a code change because developers rarely prioritize updating external diagrams during rapid feature iterations. * Complex conditional logic (e.g., branching based on whether a user is a representative or an agent) makes manual flowcharts cluttered and difficult to read. ### Static Analysis via AST * The team chose static analysis over runtime analysis to capture all possible navigation paths simultaneously without the need to execute every branch of the code. * They utilized the `ts-morph` library to parse TypeScript source code into an Abstract Syntax Tree (AST), which represents the code structure in a way the compiler understands. * This method allows for a comprehensive scan of the project to identify every instance of navigation calls like `router.push()` or `router.replace()`. ### Engineering the Navigation Edge Data Structure * A "Navigation Edge" data structure was designed to capture more than just the destination; it includes the navigation method, query parameters, and the exact line number in the source code. * The system records the "context" of a transition by traversing the AST upwards from a navigation call to find the parent `if` statements or ternary operators, effectively documenting the business logic behind the path. * By distinguishing between `push` (which adds to browser history) and `replace` (which does not), the documentation provides insights into the intended user experience and "back button" behavior. ### Tracking Hidden Navigation and Constants * **Custom Hook Analysis:** Since navigation logic is often abstracted into hooks, the tool scans `import` declarations to follow and analyze logic within external hook files. * **Constant Resolution:** Because developers use constants (e.g., `URLS.PAYMENT_METHOD`) rather than raw strings, the system parses the project's constant definition files to map these variables back to their actual URL paths. * **Source Attribution:** The system flags whether a transition originated directly from a page component or an internal hook, making it easier for developers to locate the source of a specific funnel behavior. ### Conclusion For teams managing complex user journeys, automating documentation through static analysis is a powerful way to eliminate technical debt and synchronization errors. By integrating this extraction logic into the development workflow, the codebase remains the definitive reference point while providing stakeholders with a clear, automated visual of the user experience.

typescript static-analysis abstract-syntax-tree ts-morph+3

toss Dec 22, 2025

Toss's AI Technology Recognized (opens in new tab)

Toss ML Engineer Jin-woo Lee presents FedLPA, a novel Federated Learning algorithm accepted at NeurIPS 2025 that addresses the critical challenges of data sovereignty and non-uniform data distributions. By allowing AI models to learn from localized data without transferring sensitive information across borders, this research provides a technical foundation for expanding services like Toss Face Pay into international markets with strict privacy regulations. ### The Challenge of Data Sovereignty in Global AI * Traditional AI development requires centralizing data on a single server, which is often impossible due to international privacy laws and data sovereignty regulations. * Federated Learning offers a solution by sending the model to the user’s device (client) rather than moving the data, ensuring raw biometric information never leaves the local environment. * Standard Federated Learning fails in real-world scenarios where data is non-IID (Independent and Identically Distributed), meaning user patterns in different countries or regions vary significantly. ### Overcoming Limitations in Category Discovery * Existing models assume all users share similar data distributions and that all data classes are known beforehand, which leads to performance degradation when encountering new demographics. * FedLPA incorporates Generalized Category Discovery (GCD) to identify both known classes and entirely "novel classes" (e.g., new fraud patterns or ethnic features) that were not present in the initial training set. * This approach prevents the model from becoming obsolete as it encounters new environments, allowing it to adapt to local characteristics autonomously. ### The FedLPA Three-Step Learning Pipeline * **Confidence-guided Local Structure Discovery (CLSD):** The system builds a similarity graph by comparing feature vectors of local data. It refines these connections using "high-confidence" samples—data points the model is certain about—to strengthen the quality of the relational map. * **InfoMap Clustering:** Instead of requiring a human to pre-define the number of categories, the algorithm uses the InfoMap community detection method. This allows the client to automatically estimate the number of unique categories within its own local data through random walks on the similarity graph. * **Local Prior Alignment (LPA):** The model uses self-distillation to ensure consistent predictions across different views of the same data. Most importantly, an LPA regularizer forces the model’s prediction distribution to align with the "Empirical Prior" discovered in the clustering phase, preventing the model from becoming biased toward over-represented classes. ### Business Implications and Strategic Value * **Regulatory Compliance:** FedLPA removes technical barriers to entry for markets like the EU or Southeast Asia by maintaining high model performance while strictly adhering to local data residency requirements. * **Hyper-personalization:** Financial services such as Fraud Detection Systems (FDS) and Credit Scoring Systems (CSS) can be trained on local patterns, allowing for more accurate detection of region-specific scams or credit behaviors. * **Operational Efficiency:** By enabling models to self-detect and learn from new patterns without manual labeling or central intervention, the system significantly reduces the cost and time required for global maintenance. Implementing localized Federated Learning architectures like FedLPA is a recommended strategy for tech organizations seeking to scale AI services internationally while navigating the complex landscape of global privacy regulations and diverse data distributions.

ai machine-learning computer-vision federated-learning+5

toss Dec 15, 2025

Customers Never Wait: How to (opens in new tab)

Toss Payments addressed the challenge of serving rapidly growing transaction data within a microservices architecture (MSA) by evolving their data platform from simple Elasticsearch indexing to a robust CQRS pattern. While Apache Druid initially provided high-performance time-series aggregation and significant cost savings, the team eventually integrated StarRocks to overcome limitations in data consistency and complex join operations. This architectural journey highlights the necessity of balancing real-time query performance with operational scalability and domain decoupling. ### Transitioning to MSA and Early Search Solutions * The shift from a monolithic structure to MSA decoupled application logic but created "data silos" where joining ledgers across domains became difficult. * The initial solution utilized Elasticsearch to index specific fields for merchant transaction lookups and basic refunds. * As transaction volumes doubled between 2022 and 2024, the need for complex OLAP-style aggregations led to the adoption of a CQRS (Command Query Responsibility Segregation) architecture. ### Adopting Apache Druid for Time-Series Data * Druid was selected for its optimization toward time-series data, offering low-latency aggregation for massive datasets. * It provided a low learning curve by supporting Druid SQL and featured automatic bitmap indexing for all columns, including nested JSON keys. * The system decoupled reads from writes, allowing the data team to serve billions of records without impacting the primary transaction databases' resources. ### Data Ingestion: Message Publishing over CDC * The team chose a message publishing approach via Kafka rather than Change Data Capture (CDC) to minimize domain dependency. * In this model, domain teams publish finalized data packets, reducing the data team's need to maintain complex internal business logic for over 20 different payment methods. * This strategy simplified system dependencies and leveraged Druid’s ability to automatically index incoming JSON fields. ### Infrastructure and Cost Optimization in AWS * The architecture separates computing and storage, using AWS S3 for deep storage to keep costs low. * Performance was optimized by using instances with high-performance local storage instead of network-attached EBS, resulting in up to 9x faster I/O. * The team utilized Spot Instances for development and testing environments, contributing to a monthly cloud cost reduction of approximately 50 million KRW. ### Operational Challenges and Druid’s Limitations * **Idempotency and Consistency:** Druid struggled with native idempotency, requiring complex "Merge on Read" logic to handle duplicate messages or state changes. * **Data Fragmentation:** Transaction cancellations often targeted old partitions, causing fragmentation; the team implemented a 60-second detection process to trigger automatic compaction. * **Join Constraints:** While Druid supports joins, its capabilities are limited, making it difficult to link complex lifecycles across payment, purchase, and settlement domains. ### Hybrid Search and Rollup Performance * To ensure high-speed lookups across 10 billion records, a hybrid architecture was built: Elasticsearch handles specific keyword searches to retrieve IDs, which are then used to fetch full details from Druid. * Druid’s "Rollup" feature was utilized to pre-aggregate data at ingestion time. * Implementing Rollup reduced average query response times from tens of seconds to under 1 second, representing a 99% performance improvement for aggregate views. ### Moving Toward StarRocks * To solve Druid's limitations regarding idempotency and multi-table joins, Toss Payments began transitioning to StarRocks. * StarRocks provides a more stable environment for managing inconsistent events and simplifies the data flow by aligning with existing analytical infrastructure. * This shift supports the need for a "Unified Ledger" that can track the entire lifecycle of a transaction—from payment to net profit—across disparate database sources.

apache-kafka data-architecture elasticsearch apache-druid+4

toss Dec 14, 2025

Painting the Wheels of a Moving (opens in new tab)

Toss Design System (TDS) underwent its first major color system overhaul in seven years to address deep-seated issues with perceptual inconsistency and fragmented cross-platform management. By transitioning to a perceptually uniform color space and an automated token pipeline, the team established a scalable infrastructure capable of supporting the brand's rapid expansion into global markets and diverse digital environments. ### Legacy Issues in Color Consistency * **Uneven luminosity across hues:** Colors sharing the same numerical value (e.g., Grey 100 and Blue 100) exhibited different perceptual brightness levels, leading to "patchy" layouts when used together. * **Discrepancies between Light and Dark modes:** Specific colors, such as Teal 50, appeared significantly more vibrant in dark mode than in light mode, forcing designers to manually customize colors for different themes. * **Accessibility hurdles:** Low-contrast colors often became invisible on low-resolution devices or virtual environments, failing to meet consistent accessibility standards. ### Technical Debt and Scaling Barriers * **Interconnected palettes:** Because the color scales were interdependent, modifying a single color required re-evaluating the entire palette across all hues and both light/dark modes. * **Fragmentation of truth:** Web, native apps, and design editors managed tokens independently, leading to "token drift" where certain colors existed on some platforms but not others. * **Business expansion pressure:** As Toss moved toward becoming a "super-app" and entering global markets, the manual process of maintaining design consistency became a bottleneck for development speed. ### Implementing Perceptually Uniform Color Spaces * **Adopting OKLCH:** Toss shifted from traditional HSL models to OKLCH to ensure that colors with the same lightness values are perceived as equally bright by the human eye. * **Automated color logic:** The team developed an automation logic that extracts accessible color combinations (backgrounds, text, and assets) for any input color, allowing third-party mini-apps to maintain brand identity without sacrificing accessibility. * **Chroma Clamping:** To ensure compatibility with standard RGB displays, the system utilizes chroma clamping to maintain intended hue and lightness even when hardware limitations arise. ### Refined Visual Correction and Contrast * **Solving the "Dark Yellow Problem":** Since mathematically consistent yellow often appears muddy or loses its "yellowness" at higher contrast levels, the team applied manual visual corrections to preserve the color's psychological impact. * **APCA-based Dark Mode optimization:** Utilizing the Advanced Perceptual Contrast Algorithm (APCA), the team increased contrast ratios in dark mode to compensate for human optical illusions and improve legibility at low screen brightness. ### Designer-Led Automation Pipeline * **Single Source of Truth:** By integrating Token Studio (Figma plugin) with GitHub, the team created a unified repository where design changes are synchronized across all platforms simultaneously. * **Automated deployment:** Designers can now commit changes and generate pull requests directly; pre-processing scripts then transform these tokens into platform-specific code for web, iOS, and Android without requiring manual developer intervention. The transition to a token-based, automated color system demonstrates that investing in foundational design infrastructure is essential for long-term scalability. For organizations managing complex, multi-platform products, adopting perceptually uniform color spaces like OKLCH can significantly reduce design debt and improve the efficiency of cross-functional teams.

design-system accessibility design-tokens oklch+4

toss Dec 10, 2025

Legacy Settlement Overhaul: From (opens in new tab)

Toss Payments recently overhauled its 20-year-old legacy settlement system to overcome deep-seated technical debt and prepare for massive transaction growth. By shifting from monolithic SQL queries and aggregated data to a granular, object-oriented architecture, the team significantly improved system maintainability, traceability, and batch processing performance. The transition focused on breaking down complex dependencies and ensuring that every transaction is verifiable and reproducible. ### Replacing Monolithic SQL with Object-Oriented Logic * The legacy system relied on a "giant common query" filled with nested `DECODE`, `CASE WHEN`, and complex joins, making it nearly impossible to identify the impact of small changes. * The team applied a "Divide and Conquer" strategy, splitting the massive query into distinct domains and refined sub-functions. * Business logic was moved from the database layer into Kotlin-based objects (e.g., `SettlementFeeCalculator`), making business rules explicit and easier to test. * This modular approach allowed for "Incremental Migration," where specific features (like exchange rate conversions) could be upgraded to the new system independently. ### Improving Traceability through Granular Data Modeling * The old system stored data in an aggregated state (Sum), which prevented developers from tracing errors back to specific transactions or reusing data for different reporting needs. * The new architecture manages data at the minimum transaction unit (1:1), ensuring that every settlement result corresponds to a specific transaction. * "Setting Snapshots" were introduced to store the exact contract conditions (fee rates, VAT status) at the time of calculation, allowing the system to reconstruct the context of past settlements. * A state-based processing model was implemented to enable selective retries for failed transactions, significantly reducing recovery time compared to the previous "all-or-nothing" transaction approach. ### Optimizing High-Resolution Data and Query Performance * Managing data at the transaction level led to an explosion in data volume, necessitating specialized database strategies. * The team implemented date-based Range Partitioning and composite indexing on settlement dates to maintain high query speeds despite the increased scale. * To balance write performance and read needs, they created "Query-specific tables" that offload the processing burden from the main batch system. * Complex administrative queries were delegated to a separate high-performance data serving platform, maintaining a clean separation between core settlement logic and flexible data analysis. ### Resolving Batch Performance and I/O Bottlenecks * The legacy batch system struggled with long processing times that scaled poorly with transaction growth due to heavy I/O and single-threaded processing. * I/O was minimized by caching merchant contract information in memory at the start of a batch step, eliminating millions of redundant database lookups. * The team optimized the `ItemProcessor` in Spring Batch by implementing bulk lookups (using a Wrapper structure) to handle multiple records at once rather than querying the database for every individual item. This modernization demonstrates that scaling a financial system requires moving beyond "convenient" aggregations toward a granular, state-driven architecture. By decoupling business logic from the database and prioritizing data traceability, Toss Payments has built a foundation capable of handling the next generation of transaction volumes.

kotlin refactoring database-design query-optimization+3

toss Dec 9, 2025

Improving Work Efficiency: Revisiting (opens in new tab)

The Toss Research Platform team argues that operational efficiency is best achieved by decomposing complex workflows into granular, atomic actions rather than attempting massive systemic overhauls. By systematically questioning the necessity of each step and prioritizing improvements based on stakeholder impact rather than personal workload, teams can eliminate significant waste through incremental automation. This approach demonstrates that even minor reductions in repetitive manual tasks can lead to substantial gains in team-wide productivity as an organization scales. ### Granular Action Mapping * Break down workflows into specific physical or digital actions—such as clicks, data entries, and channel switches—rather than high-level phases. * Document the "Who, Where, What, and Why" for every individual step to identify exactly where friction occurs. * Include exception cases and edge scenarios in the process map to uncover hidden gaps in the current operating model. ### Questioning Necessity and Identifying Automation Targets * Apply a critical filter to every mapped action by asking, "Why is this necessary?" to eliminate redundant tasks like manual cross-platform notifications. * Distinguish between essential human-centric tasks and mechanical actions, such as calendar entry creation, that are ripe for automation. * Address "micro-inefficiencies" that appear insignificant in isolation but aggregate into major resource drains when repeated multiple times daily across a large team. ### Stakeholder-Centric Prioritization * Shift the criteria for optimization from personal convenience to the impact on the broader organization. * Rank improvements based on three specific metrics: the number of people affected, the downstream influence on other workflows, and the total cumulative time consumed. * Recognize that automating a "small" task for an operator can unlock significant time and clarity for dozens of participants and observers. ### Incremental Implementation and Risk Mitigation * Avoid the "all-or-nothing" automation trap by deploying partial solutions that address solvable segments of a process immediately. * Utilize designated test periods for process changes to monitor for risks, such as team members missing interviews due to altered notification schedules. * Gather continuous feedback from stakeholders during small-scale experiments, allowing for iterative adjustments or quick reversals before a full rollout. To scale operations effectively, start by breaking your current workload into its smallest possible components and identifying the most frequent manual repetitions. True efficiency often comes from these small, validated adjustments and consistent feedback loops rather than waiting to build a perfect, fully automated end-to-end system.

automation workflow-automation productivity process-optimization+3

toss Dec 8, 2025

Enhancing Data Literacy for (opens in new tab)

Toss’s Business Data Team addressed the lack of centralized insights into their business customer (BC) base by building a standardized Single Source of Truth (SSOT) data mart and an iterative Monthly BC Report. This initiative successfully unified fragmented data across business units like Shopping, Ads, and Pay, enabling consistent data-driven decision-making and significantly raising the organization's overall data literacy. ## Establishing a Single Source of Truth (SSOT) - Addressed the inefficiency of fragmented data across various departments by integrating disparate datasets into a unified, enterprise-wide data mart. - Standardized the definition of an "active" Business Customer through cross-functional communication and a deep understanding of how revenue and costs are generated in each service domain. - Eliminated communication overhead by ensuring all stakeholders used a single, verified dataset rather than conflicting numbers from different business silos. ## Designing the Monthly BC Report for Actionable Insights - Visualized monthly revenue trends by segmenting customers into specific tiers and categories, such as New, Churn, and Retained, to identify where growth or attrition was occurring. - Implemented Cohort Retention metrics by business unit to measure platform stickiness and help teams understand which services were most effective at retaining business users. - Provided granular Raw Data lists for high-revenue customers showing significant growth or churn, allowing operational teams to identify immediate action points. - Refined reporting metrics through in-depth interviews with Product Owners (POs), Sales Leaders, and Domain Heads to ensure the data addressed real-world business questions. ## Technical Architecture and Validation - Built the core SSOT data mart using Airflow for scalable data orchestration and workflow management. - Leveraged Jenkins to handle the batch processing and deployment of the specific data layers required for the reporting environment. - Integrated Tableau with SQL-based fact aggregations to automate the monthly refresh of charts and dashboards, ensuring the report remains a "living" document. - Conducted "collective intelligence" verification meetings to check metric definitions, units, and visual clarity, ensuring the final report was intuitive for all users. ## Driving Organizational Change and Data Literacy - Sparked a surge in data demand, leading to follow-up projects such as daily real-time tracking, Cross-Domain Activation analysis, and deeper funnel analysis for BC registrations. - Transitioned the organizational culture from passive data consumption to active utilization, with diverse roles—including Strategy Managers and Business Marketers—now using BC data to prove their business impact. - Maintained an iterative approach where the report format evolves every month based on stakeholder feedback, ensuring the data remains relevant to the shifting needs of the business. Establishing a centralized data culture requires more than just technical infrastructure; it requires a commitment to iterative feedback and clear communication. By moving from fragmented silos to a unified reporting standard, data analysts can transform from simple "number providers" into strategic partners who drive company-wide literacy and growth.

data-pipeline data-analysis jenkins data-mart+4

Filter by tag