토스 / ai

6 posts

toss

개발자는 AI에게 대체될 것인가 (opens in new tab)

The current AI hype cycle is a significant economic bubble where massive infrastructure investments of $560 billion far outweigh the modest $35 billion in generated revenue. However, drawing parallels to the 1995 dot-com era, the author argues that while short-term expectations are overblown, the long-term transformation of the developer role is inevitable. The conclusion is that developers won't be replaced but will instead evolve into "Code Creative Directors" who manage AI through the lens of technical abstraction and delegation. ### The Economic Bubble and Amara’s Law * The industry is experiencing a 16:1 imbalance between AI investment and revenue, with 95% of generative AI implementations reportedly failing to deliver clear efficiency improvements. * Amara’s Law suggests that we are overestimating AI's short-term impact while potentially underestimating its long-term necessity. * Much of the current "AI-driven" job market contraction is actually a result of companies cutting personnel costs to fund expensive GPU infrastructure and AI research. ### Jevons Paradox and the Evolution of Roles * Jevons Paradox indicates that as the "cost" of producing code drops due to AI efficiency, the total demand for software and the complexity of systems will paradoxically increase. * The developer’s identity is shifting from "code producer" to "system architect," focusing on agent orchestration, result verification, and high-level design. * AI functions as a "power tool" similar to game engines, allowing small teams to achieve professional-grade output while amplifying the capabilities of senior engineers. ### Delegation as a Form of Abstraction * Delegating a task to AI is an act of "work abstraction," which involves choosing which low-level details a developer can afford to ignore. * The technical boundary of what is "hard to delegate" is constantly shifting; for example, a complex RAG (Retrieval-Augmented Generation) pipeline built for GPT-4 might become obsolete with the release of a more capable model like GPT-5. * The focus for developers must shift from "what is easy to delegate" to "what *should* be delegated," distinguishing between routine boilerplate and critical human judgment. ### The Risks of Premature Abstraction * Abstraction does not eliminate complexity; it simply moves it into the future. If the underlying assumptions of an AI-generated system change, the abstraction "leaks" or breaks. * Sudden shifts in scaling (traffic surges), regulation (GDPR updates), or security (zero-day vulnerabilities) expose the limitations of AI-delegated work, requiring senior intervention. * Poorly managed AI delegation can lead to "abstraction debt," where the cost of fixing a broken AI-generated system exceeds the cost of having written it manually from the start. To thrive in this environment, developers should embrace AI not as a replacement, but as a layer of abstraction. Success requires mastering the ability to define clear boundaries for AI—delegating routine CRUD operations and boilerplate while retaining human control over architecture, security, and complex business logic.

toss

세금 환급 자동화 : AI-driven UI 테스트 자동화 일지 (opens in new tab)

At Toss Income, QA Manager Suho Jung successfully automated complex E2E testing for diverse tax refund services by leveraging AI as specialized virtual team members. By shifting from manual coding to a "human-as-orchestrator" model, a single person achieved the productivity of a four-to-five-person automation team within just five months. This approach overcame the inherent brittleness of testing long, React-based flows that are subject to frequent policy changes and external system dependencies. ### Challenges in Tax Service Automation The complexity of tax refund services presented unique hurdles that made traditional manual automation unsustainable: * **Multi-Step Dependencies:** Each refund flow averages 15–20 steps involving internal systems, authentication providers, and HomeTax scraping servers, where a single timing glitch can fail the entire test. * **Frequent UI and Policy Shifts:** Minor UI updates or new tax laws required total scenario reconfigurations, making hard-coded tests obsolete almost immediately. * **Environmental Instability:** Issues such as "Target closed" errors during scraping, differing domain environments, and React-specific hydration delays caused constant test flakiness. ### Building an AI-Driven QA Team Rather than using AI as a simple autocomplete tool, the project assigned specific "personas" to different AI models to handle distinct parts of the lifecycle: * **SDET Agent (Claude Sonnet 4.5):** Acted as the lead developer, responsible for designing the Page Object Model (POM) architecture, writing test logic, and creating utility functions. * **Documentation Specialist:** Automatically generated daily retrospectives and updated technical guides by analyzing daily git commits. * **Git Master:** Managed commit history and PR descriptions to ensure high-quality documentation of the project’s evolution. * **Pair Programmers (Cursor & Codex):** Handled real-time troubleshooting, type errors, and comparative analysis of different test scripts. ### Technical Solutions for React and Policy Logic The team implemented several sophisticated technical strategies to ensure test stability: * **React Interaction Readiness:** To solve "Element is not clickable" errors, they developed a strategy that waits not just for visibility, but for event handlers to bind to the DOM (Hydration). * **Safe Interaction Fallbacks:** A standard `click` utility was created that attempts a Playwright click, then a native keyboard 'Enter' press, and finally a JS dispatch to ensure interactions succeed even during UI transitions. * **Dynamic Consent Flow Utility:** A specialized system was built to automatically detect and handle varying "Terms of Service" agreements across different sub-services (Tax Secretary, Hidden Refund, etc.) through a single unified function. * **Test Isolation:** Automated scripts were used to prevent `userNo` (test ID) collisions, ensuring 35+ complex scenarios could run in parallel without data interference. ### Integrated Feedback and Reporting The automation was integrated directly into internal communication channels to create a tight feedback loop: * **Messenger Notifications:** Every test run sends a report including execution time, test IDs, and environment data to the team's messenger. * **Automated Failure Analysis:** When a test fails, the AI automatically posts the error log, the specific failed step, a tracking EventID, and a screenshot as a thread reply for immediate debugging. * **Human-AI Collaboration:** This structure shifted the QA's role from writing code to discussing failures and policy changes within the messenger threads. The success of this 5-month experiment suggests that for high-complexity environments, the future of QA lies in "AI Orchestration." Instead of focusing on writing selectors, QA engineers should focus on defining problems and managing the AI agents that build the architecture.

toss

LLM을 이용한 서비스 취약점 분석 자동화 #1 (opens in new tab)

Toss has developed a high-precision automated vulnerability analysis system by integrating Large Language Models (LLMs) with traditional security testing tools. By evolving their architecture from a simple prompt-based approach to a multi-agent system utilizing open-source models and static analysis, the team achieved over 95% accuracy in threat detection. This project demonstrates that moving beyond a technical proof-of-concept requires solving real-world constraints such as context window limits, output consistency, and long-term financial sustainability. ### Navigating Large Codebases with MCP * Initial attempts to use RAG (Retrieval Augmented Generation) and repository compression tools failed because the LLM could not maintain complex code relationships within token limits. * The team implemented a "SourceCode Browse MCP" (Model Context Protocol) which allows the LLM agent to dynamically query the codebase. * By indexing the code, the agent can perform specific tool calls to find function definitions or variable usages only when necessary, effectively bypassing context window restrictions. ### Ensuring Consistency via SAST Integration * Testing revealed that standalone LLMs produced inconsistent results, often missing known vulnerabilities or generating hallucinations across different runs. * To solve this, the team integrated Semgrep, a Static Application Security Testing (SAST) tool, to identify all potential "Source-to-Sink" paths. * Semgrep was chosen over CodeQL due to its lighter resource footprint and faster execution, acting as a structured roadmap that ensures the LLM analyzes every suspicious input path without omission. ### Optimizing Costs with Multi-Agent Architectures * Analyzing every possible code path identified by SAST tools was prohibitively expensive due to high token consumption. * The workflow was divided among three specialized agents: a Discovery Agent to filter out irrelevant paths, an Analysis Agent to perform deep logic checks, and a Verification Agent to confirm findings. * This "sieve" strategy ensured that the most resource-intensive analysis was only performed on high-probability vulnerabilities, significantly reducing operational costs. ### Transitioning to Open Models for Sustainability * Scaling the system to hundreds of services and daily PRs made proprietary cloud models financially unviable. * After benchmarking models like Llama 3.1 and GPT-OSS, the team selected **Qwen3:30B** for its 100% coverage rate and high true-positive accuracy in vulnerability detection. * To bridge the performance gap between open-source and proprietary models, the team utilized advanced prompt engineering, one-shot learning, and enforced structured JSON outputs to improve reliability. To build a production-ready AI security tool, teams should focus on the synergy between specialized open-source models and traditional static analysis tools. This hybrid approach provides a cost-effective and sustainable way to achieve enterprise-grade accuracy while maintaining full control over the analysis infrastructure.

toss

토스의 AI 기술력, 세계 최고 권위 NeurIPS 2025에서 인정받다: FedLPA 연구 (opens in new tab)

Toss ML Engineer Jin-woo Lee presents FedLPA, a novel Federated Learning algorithm accepted at NeurIPS 2025 that addresses the critical challenges of data sovereignty and non-uniform data distributions. By allowing AI models to learn from localized data without transferring sensitive information across borders, this research provides a technical foundation for expanding services like Toss Face Pay into international markets with strict privacy regulations. ### The Challenge of Data Sovereignty in Global AI * Traditional AI development requires centralizing data on a single server, which is often impossible due to international privacy laws and data sovereignty regulations. * Federated Learning offers a solution by sending the model to the user’s device (client) rather than moving the data, ensuring raw biometric information never leaves the local environment. * Standard Federated Learning fails in real-world scenarios where data is non-IID (Independent and Identically Distributed), meaning user patterns in different countries or regions vary significantly. ### Overcoming Limitations in Category Discovery * Existing models assume all users share similar data distributions and that all data classes are known beforehand, which leads to performance degradation when encountering new demographics. * FedLPA incorporates Generalized Category Discovery (GCD) to identify both known classes and entirely "novel classes" (e.g., new fraud patterns or ethnic features) that were not present in the initial training set. * This approach prevents the model from becoming obsolete as it encounters new environments, allowing it to adapt to local characteristics autonomously. ### The FedLPA Three-Step Learning Pipeline * **Confidence-guided Local Structure Discovery (CLSD):** The system builds a similarity graph by comparing feature vectors of local data. It refines these connections using "high-confidence" samples—data points the model is certain about—to strengthen the quality of the relational map. * **InfoMap Clustering:** Instead of requiring a human to pre-define the number of categories, the algorithm uses the InfoMap community detection method. This allows the client to automatically estimate the number of unique categories within its own local data through random walks on the similarity graph. * **Local Prior Alignment (LPA):** The model uses self-distillation to ensure consistent predictions across different views of the same data. Most importantly, an LPA regularizer forces the model’s prediction distribution to align with the "Empirical Prior" discovered in the clustering phase, preventing the model from becoming biased toward over-represented classes. ### Business Implications and Strategic Value * **Regulatory Compliance:** FedLPA removes technical barriers to entry for markets like the EU or Southeast Asia by maintaining high model performance while strictly adhering to local data residency requirements. * **Hyper-personalization:** Financial services such as Fraud Detection Systems (FDS) and Credit Scoring Systems (CSS) can be trained on local patterns, allowing for more accurate detection of region-specific scams or credit behaviors. * **Operational Efficiency:** By enabling models to self-detect and learn from new patterns without manual labeling or central intervention, the system significantly reduces the cost and time required for global maintenance. Implementing localized Federated Learning architectures like FedLPA is a recommended strategy for tech organizations seeking to scale AI services internationally while navigating the complex landscape of global privacy regulations and diverse data distributions.

toss

토스 Next ML Challenge - 광고 클릭 예측(PCTR) ML 경진대회 출제 후기 (opens in new tab)

Toss recently hosted the "Toss Next ML Challenge," a large-scale competition focused on predicting advertisement Click-Through Rates (CTR) using real-world, anonymized data from the Toss app. By tasking over 2,600 participants with developing high-performance models under real-time serving constraints, the event successfully identified innovative technical approaches to feature engineering and model ensembling. ### Designing a Real-World CTR Prediction Task * The competition required participants to predict the probability of a user clicking a display ad based on a dataset of 10.7 million training samples. * Data included anonymized features such as age, gender, ad inventory IDs, and historical user behavior. * A primary technical requirement was "real-time navigability," meaning models had to be optimized for fast inference to function within a live service environment. ### Overcoming Anonymization with Sequence Engineering * To maintain data privacy while allowing external access, Toss provided anonymized features in a single flattened table, which limited the ability of participants to perform traditional data joins. * A complex, raw "Sequence" feature was intentionally left unprocessed to serve as a differentiator for high-performing teams. * Top-tier participants demonstrated extreme persistence by deriving up to 37 unique variables from this single sequence, including transition probabilities, unique token counts, and sequence lengths. ### Winning Strategies and Technical Trends * All of the top 30 teams utilized Boosting Tree-based models (such as XGBoost or LightGBM), while Deep Learning was used only by a subset of participants. * One standout solution utilized a massive ensemble of 260 different models, providing a fresh perspective on the limits of ensemble learning for predictive accuracy. * Performance was largely driven by the ability to extract meaningful signals from anonymized data through rigorous cross-validation and creative feature interactions. The results of the Toss Next ML Challenge suggest that even in the absence of domain-specific context due to anonymization, meticulous feature engineering and robust tree-based architectures remains the gold standard for tabular data. For ML engineers, the competition underscores that the key to production-ready models lies in balancing complex ensembling with the strict latency requirements of real-time serving.

toss

누구나 리서치 하는 시대, UX리서처의 생존법 (opens in new tab)

In an era where AI moderators and non-researchers handle the bulk of data collection, the role of the UX researcher has shifted from a technical specialist to a strategic guide. The core value of the researcher now lies in "UX Leadership"—the ability to frame problems, align team perspectives, and define the fundamental identity of a product. By bridging the gap between business goals and user needs, researchers ensure that products solve real problems rather than just chasing metrics or technical feasibility. ### Setting the Framework in the Idea Phase When starting a new project, a researcher’s primary task is to establish the "boundaries of the puzzle" by shifting the team’s focus from business impact to user value. * **Case - AI Signal:** For a service that interprets stock market events using AI, the team initially focused on business metrics like retention and news consumption. * **Avoiding "Metric Traps":** A researcher intervenes to prevent fatigue-inducing UX (e.g., excessive notifications to boost CTR) by defining the "North Star" as the specific problem the user is trying to solve. * **The Checklist:** Once the user problem and value are defined, they serve as a persistent checklist for every design iteration and action item. ### Aligning Team Direction for Product Improvements When a product already exists but needs improvement, different team members often have scattered, subjective opinions on what to fix. The researcher structures these thoughts into a cohesive direction. * **Case - Stock Market Calendar:** While the team suggested UI changes like "it doesn't look like a calendar," the researcher refocused the effort on the user's ultimate goal: making better investment decisions. * **Defining Success Criteria:** The team agreed on a "Good Usage" standard based on three stages: Awareness (recognizing issues) → Understanding (why it matters) → Preparation (adjusting investment plans). * **Identifying Obstacles:** By identifying specific friction points—such as the lack of information hierarchy or the difficulty of interpreting complex indicators—the researcher moves the project from "simple UI cleanup" to "essential tool development." ### Redefining Product Identity During Stagnation When a product's growth stalls, the issue often isn't a specific UI bug but a fundamental mismatch between the product's identity and its environment. * **Case - Toss Securities PC:** Despite being functional, the PC version struggled because it initially tried to copy the "mobile simplicity" of the app. * **Contextual Analysis:** Research revealed that while mobile users value speed and portability, PC users require an environment for deep analysis, multi-window comparisons, and deliberate decision-making. * **Consensus through Synthesis:** The researcher integrates data, user interviews, and market trends into workshops to help the team decide where the product should "live" in the market. This process creates team-wide alignment on a new strategic direction rather than just fixing features. The modern UX researcher must move beyond "crafting the tool" (interviewing and data gathering) and toward "UX Leadership." True expertise involves maintaining a broad view of the industry and product ecosystem, structuring team discussions to reach a consensus, and ensuring that every product decision is rooted in a clear understanding of the user's context and goals.