토스

36 posts

toss.tech

Filter by tag

toss

Metric Review, Driving Execution (opens in new tab)

안녕하세요, 토스플레이스에서 Data Platform Team을 이끌고 있는 박종익입니다. "인사이트는 분명히 나왔는데, 왜 실행은 느릴까요?" 데이터 조직에 있다 보면 이 질문을 자주 마주하게 됩니다. 분석은 쌓이고, 대시보드는 채워지는데 — 정작 제품이나 사업에 직접적인 변화가 일어나는 속도는 기대에 미치지 못하는 경우가 많아요. 저희도 같은 고민을 오랫동안 해왔습니다. 그 고민에서 시작한 것이 바로 Metric Review입니다. 오늘은 저희가 왜 Metric Review를 시작했고, 어떻…

toss

Automating Service Vulnerability Analysis using LLM #2 (opens in new tab)

*이 글은 연구 개발망에서 진행된 내용을 바탕으로 합니다. 안녕하세요. 토스 Security Researcher 표상영입니다. 지난 글에서는 LLM을 이용해 서비스 취약점 분석을 자동화하면서 마주했던 문제점과 그에 대한 해결책들을 간단히 소개드렸습니다. 이전 글을 작성한 시점부터 벌써 3개월이 지났는데요. 불과 몇 달 사이에 AI의 취약점 분석 능력은 정말 높은 수준으로 올라왔습니다. 이렇게 가파른 기술 발전 속도에 따라, AI를 대하는 저의 자세와 생각도 많이 바뀌게 되었어요. 이번 글에서는…

toss

Foreign User Research: Why (opens in new tab)

혹시 외국인이 보는 한국의 금융 시스템이 어떤지 아시나요? 미국의 유명 커뮤니티 Reddit에서 “Korean Banking”을 검색해 보면, 외국인들이 느끼는 한국 금융 시스템의 인상을 그대로 볼 수 있어요. 누군가의 도움 없이는 이해하기 어렵고, 전반적인 경험도 복잡하게 느껴진다고 해요. 그래서일까요? 토스에 가입했더라도 제대로 사용하지 못하는 외국인 사용자들이 많았어요. “모두를 위한 금융”이 토스의 비전이라면, 외국인이라고 해서 그 대상에서 제외되어서는 안된다고 생각했어요. 외국인도 편하…

toss

From Intern to Solo Designer: Growth (opens in new tab)

안녕하세요. 토스뱅크 Product Designer 전누리예요. 이번 글에서는 인턴으로 토스에 합류해 처음으로 실험을 설계했던 경험을 공유해보려고 해요. 제가 맡은 첫 과제는 토스뱅크 비회원의 가입 전환율을 높이는 것이었어요. 처음 실험을 설계해야 했던 저는 세 가지가 가장 어려웠어요. 1️⃣ 이탈이 큰 구간이 여러 개인데, 어디부터 개선해야 할까? 2️⃣ 이미 많은 실험이 진행되었는데, 나는 뭘 더 할 수 있을까? 3️⃣ 가설을 어떻게 세워야 흔들리지 않을까? 이 세 가지 고민을 어떻게 풀었는…

toss

The Software 3.0 (opens in new tab)

당신의 팀은 같은 LLM을 쓰고 있나요? 현재 많은 개발팀이 LLM을 도입하고 있지만, 냉정하게 들여다보면 그것은 '각자도생'에 가깝습니다. 같은 모델, 같은 IDE를 쓰는데도 결과물의 차이는 극심합니다. 어떤 엔지니어는 '컨텍스트 엔지니어링(Context Engineering)'에 대한 높은 이해도로 LLM에게 정확한 역할을 부여해 10분 만에 복잡한 리팩토링을 끝냅니다. 반면, 어떤 엔지니어는 단순한 질문과 답변을 반복하며 할루시네이션과 씨름하느라 1시간을 허비하죠. 예를 들어, 같은 레포지…

toss

Easy-to-use Toss Front SDK (opens in new tab)

안녕하세요, 토스플레이스 Frontend Developer 이주함입니다. 저는 토스플레이스에서 자체 개발한 결제 단말기인 Toss Front(이하 프론트)의 외부 연동 SDK(Software Development Kit)를 개발하고 있습니다. 이 SDK를 활용하면 토스 서비스의 데이터를 연동해 내가 원하는 플러그인 앱을 개발하고, 프론트에서 동작하도록 연동할 수 있어요. 즉, 3rd-party의 연동을 통해 내부 개발이 아닌, 외부 연동사의 개발로 무한히 확장할 수 있는 구조입니다. 이 글에서는…

toss

From Perimeter Security to Zero (opens in new tab)

Toss Payments transformed its security infrastructure from a vulnerable, single-layered legacy system into a robust "Defense in Depth" architecture spanning hybrid IDC and AWS environments. By integrating advanced perimeter defense, internal server monitoring, and container runtime security, the team established a comprehensive framework that prioritizes visibility and continuous verification. This four-year journey demonstrates that modern security requires moving beyond simple boundary protection toward a proactive, multi-layered strategy that assumes breaches can occur. ### Perimeter Defense and SSL/TLS Visibility * Addressed the critical visibility gap in legacy systems by implementing dedicated SSL/TLS decryption tools, allowing the team to analyze encrypted traffic for hidden malicious payloads. * Established a hybrid security architecture using a combination of physical DDoS protection, IPS, and WAF in IDC environments, complemented by AWS WAF and AI-based GuardDuty in the cloud. * Developed a collaborative merchant response process that moves beyond simple IP blocking; the system automatically detects malicious traffic from partners and provides them with detailed vulnerability reports and remediation guides (e.g., specific SQL injection points). ### Internal Network Security and "Assume Breach" Monitoring * Implemented **Wazuh**, an open-source security platform, in IDC environments to monitor lateral movement, collect centralized logs, and perform file integrity checks across diverse operating systems. * Leveraged **AWS GuardDuty** for intelligent threat detection in the cloud, focusing on malware scanning for EC2 instances and monitoring for suspicious process activities. * Established automated detection for privilege escalation and unauthorized access to sensitive system files, such as tracking instances where root privileges are obtained to modify the `/etc/passwd` file. ### Container Runtime Security as the Final Defense * Adopted **Falco**, a CNCF-hosted runtime security tool, to protect Kubernetes environments by monitoring system calls (syscalls) in real-time. * Configured specific security rules to detect "container escape" attempts, unauthorized access to sensitive files like `/etc/shadow`, and the execution of new or suspicious binaries within running containers. * Integrated **Falco Sidekick** to manage security events efficiently, ensuring that anomalous behaviors at the container level are instantly routed to the security team for response. ### Zero Trust and Continuous Verification * Shifted toward a Zero Trust model for the internal work network to ensure that all users and devices are continuously verified regardless of their location. * Focused on implementing dynamic access control and the principle of least privilege to minimize the potential impact of credential theft or device compromise. Organizations operating in hybrid cloud environments should move away from relying on a single perimeter and instead adopt a multi-layered defense strategy. True security resilience is achieved by gaining deep visibility into encrypted traffic and maintaining granular monitoring at the server and container levels to intercept threats that inevitably bypass initial defenses.

toss

6 Principles to Increase Marketing (opens in new tab)

Toss, a leader in the Korean fintech space, demonstrates that high marketing performance can be achieved without resorting to aggressive or deceptive copy. By analyzing hundreds of A/B tests within their app, they have identified specific UX writing patterns that prioritize user trust while significantly boosting engagement. The core conclusion is that clarity, psychological ease, and guaranteed rewards consistently outperform complex value propositions and exaggerated claims. ### The Power of One Core Message * Focusing on a single, immediate action is more effective than listing multiple service benefits. * In one test, replacing a complex benefit-driven headline with a simple "Take a 10-question test" resulted in a 10x increase in click-through rates (CTR). * Complexity creates friction; users are more likely to engage when they understand exactly what the next step entails without distractions. ### Prioritizing Guaranteed Rewards * Users show a stronger preference for "guaranteed small wins" over "potential big wins." * A campaign promising a "Minimum 100 won" reward saw 20x more exposure than one promising "Up to 1 million won," as large numbers can trigger skepticism or feel unattainable. * Phrases like "You will definitely get 1" outperform "Get as many as you want" because they provide a concrete promise rather than a vague possibility. ### Reducing Cognitive Load Through "Light" Language * The choice of verbs significantly impacts the perceived effort of a task. * Using "Prepare for travel insurance" instead of "Sign up for travel insurance" reduces the psychological burden, as "sign up" implies a long, bureaucratic process. * "Light" verbs make the service feel faster and easier to complete, encouraging immediate action. ### Strategic Information Framing * Clearly defining the nature of information—whether it is a "collection," a "list," or "new"—helps users categorize the value quickly. * Highlighting that a feature is "new" rather than explaining the specific benefits of the feature increased CTR by 6x. * Using terms like "View collection" for loan products provides a sense of organized efficiency that appeals to users looking for consolidated information. ### Specificity in Action and Conditions * Ambiguity leads to hesitation; providing exact numbers (e.g., "4 missions" or "8 blanks") increases conversion rates. * Specifying the number of tasks makes a goal feel attainable and removes the fear of an open-ended time commitment. * Quantifying the effort required (e.g., "takes 3 minutes") allows users to make an instant, friction-less decision to participate. ### Utilizing Intuitive, Everyday Experiences * Copy that mirrors real-life physical actions is more intuitive for users. * Changing a button from "View answer" to "Pick an answer" (accompanied by a stamp emoji) for an OX quiz significantly increased engagement by making the digital action feel more tactile and familiar. * Leveraging common vocabulary ensures that users do not have to "translate" marketing speak into practical reality. To maximize conversion, designers and writers should move away from broad marketing claims and toward radical specificity. By removing ambiguity and promising certain, low-effort outcomes, you can build a more effective and honest user experience.

toss

Welcoming the Era of (opens in new tab)

The tech industry is shifting from Software 1.0 (explicit logic) and 2.0 (neural networks) into Software 3.0, where natural language prompts and autonomous agents act as the primary programming interface. While Large Language Models (LLMs) are the engines of this era, they require a "Harness"—a structured environment of tools and protocols—to perform real-world tasks effectively. This evolution does not render traditional engineering obsolete; instead, it demonstrates that robust architectural principles like layered design and separation of powers are essential for building reliable AI agents. ### The Evolution of Software 3.0 * Software 1.0 is defined by explicit "How" logic written in languages like Python or Java, while Software 2.0 focuses on weights and data in neural networks. * Software 3.0, popularized by Andrej Karpathy, moves to "What" logic, where natural language prompts drive the execution. * The "Harness" concept is critical: just as a horse needs a harness to be useful to a human, an LLM needs tools (CLI, API access, file systems) to move from a chatbot to a functional agent like Claude Code. ### Mapping Agent Architecture to Traditional Layers * **Slash Commands as Controllers:** Tools like `/review` or `/refactor` act as entry points for user requests, similar to REST controllers in Spring or Express. * **Sub-agents as the Service Layer:** Sub-agents coordinate multiple skills and maintain independent context, mirroring how services orchestrate domain objects and repositories. * **Skills as Domain Components:** Following the Single Responsibility Principle (SRP), individual skills should handle one clear task (e.g., "generating tests") to prevent logic bloat. * **MCP as Infrastructure/Adapters:** The Model Context Protocol (MCP) functions like the Repository or Adapter pattern, abstracting external systems like databases and APIs from the core logic. * **CLAUDE.md as Configuration:** Project-specific rules and tech stacks are stored in metadata files, acting as the `package.json` or `pom.xml` of the agent environment. ### From Exceptions to Questions * Traditional 1.0 software must have every branch of logic predefined; if an unknown state is reached, the system throws an exception or fails. * Software 3.0 introduces Human-in-the-Loop (HITL), where "Exceptions" become "Questions," allowing the agent to ask for clarification on high-risk or ambiguous tasks. * Effective agent design requires identifying when to act autonomously (reversible, low-risk tasks) versus when to delegate decisions to a human (deployments, deletions, or high-cost API calls). ### Managing Constraints: Tokens and Complexity * In Software 3.0, tokens represent the "memory" (RAM) of the system; large codebases can lead to "token explosion," causing context overflow or high costs. * Deterministic logic should be moved to external scripts rather than being interpreted by the LLM every time to save tokens and ensure consistency. * To avoid "Skill Explosion" (similar to Class Explosion), developers should use "Progressive Disclosure," providing the agent with a high-level entry point and only loading detailed task knowledge when specifically required. Traditional software engineering expertise—specifically in cohesion, coupling, and abstraction—is the most valuable asset when transitioning to Software 3.0. By treating prompt engineering and agent orchestration with the same architectural rigor as 1.0 code, developers can build agents that are scalable, maintainable, and truly useful.

toss

Will developers be replaced by AI? (opens in new tab)

The current AI hype cycle is a significant economic bubble where massive infrastructure investments of $560 billion far outweigh the modest $35 billion in generated revenue. However, drawing parallels to the 1995 dot-com era, the author argues that while short-term expectations are overblown, the long-term transformation of the developer role is inevitable. The conclusion is that developers won't be replaced but will instead evolve into "Code Creative Directors" who manage AI through the lens of technical abstraction and delegation. ### The Economic Bubble and Amara’s Law * The industry is experiencing a 16:1 imbalance between AI investment and revenue, with 95% of generative AI implementations reportedly failing to deliver clear efficiency improvements. * Amara’s Law suggests that we are overestimating AI's short-term impact while potentially underestimating its long-term necessity. * Much of the current "AI-driven" job market contraction is actually a result of companies cutting personnel costs to fund expensive GPU infrastructure and AI research. ### Jevons Paradox and the Evolution of Roles * Jevons Paradox indicates that as the "cost" of producing code drops due to AI efficiency, the total demand for software and the complexity of systems will paradoxically increase. * The developer’s identity is shifting from "code producer" to "system architect," focusing on agent orchestration, result verification, and high-level design. * AI functions as a "power tool" similar to game engines, allowing small teams to achieve professional-grade output while amplifying the capabilities of senior engineers. ### Delegation as a Form of Abstraction * Delegating a task to AI is an act of "work abstraction," which involves choosing which low-level details a developer can afford to ignore. * The technical boundary of what is "hard to delegate" is constantly shifting; for example, a complex RAG (Retrieval-Augmented Generation) pipeline built for GPT-4 might become obsolete with the release of a more capable model like GPT-5. * The focus for developers must shift from "what is easy to delegate" to "what *should* be delegated," distinguishing between routine boilerplate and critical human judgment. ### The Risks of Premature Abstraction * Abstraction does not eliminate complexity; it simply moves it into the future. If the underlying assumptions of an AI-generated system change, the abstraction "leaks" or breaks. * Sudden shifts in scaling (traffic surges), regulation (GDPR updates), or security (zero-day vulnerabilities) expose the limitations of AI-delegated work, requiring senior intervention. * Poorly managed AI delegation can lead to "abstraction debt," where the cost of fixing a broken AI-generated system exceeds the cost of having written it manually from the start. To thrive in this environment, developers should embrace AI not as a replacement, but as a layer of abstraction. Success requires mastering the ability to define clear boundaries for AI—delegating routine CRUD operations and boilerplate while retaining human control over architecture, security, and complex business logic.

toss

Toss Income QA Platform: The Beginning (opens in new tab)

Toss's QA team developed an internal "QA Platform" to solve the high barrier to entry associated with using Swagger for manual testing and data setup. By transforming complex, multi-step API calls into a simple, button-based GUI, the team successfully empowered non-QA members to perform self-verification. This shift effectively moved quality assurance from a final-stage bottleneck to a continuous, integrated part of the development process, significantly increasing product delivery speed. ### Lowering the Barrier to Test APIs * Existing Swagger documentation was functionally complete but difficult for developers or planners to use due to the need for manual JSON editing and sequential API execution. * The QA Platform does not create new APIs; instead, it provides a GUI layer over existing Swagger Test APIs to make them accessible without technical documentation. * The system offers two distinct interfaces: "Normal Mode" for simplified, one-click testing and "Swagger Mode" for granular control over request bodies and parameters. ### From Manual Clicks to Automation and Management * Phase 1 focused on visual accessibility, allowing users to trigger complex data states via buttons rather than manual API orchestration. * Phase 2 integrates existing automation scripts into the platform, removing the need for local environment setups and allowing anyone to execute automated test suites. * The final phase aims to transition into a comprehensive Test Management System (TMS) tailored to the team's specific workflow, reducing reliance on third-party external tools. ### Redefining Quality as a Design Choice * By reducing the time and mental effort required to run a test, verification became a frequent, daily habit for the entire product team rather than a chore for the QA department. * Lowering the "cost" of testing replaced guesswork with data-driven confidence, allowing the team to move faster during development. * This initiative reflects a philosophical shift where quality is no longer viewed as a final checklist item but as a core structural element designed into the development lifecycle. The primary takeaway for engineering teams is that the speed of a product is often limited by the friction of its testing process. By building internal tools that democratize testing capabilities—making them available to anyone regardless of their technical role—organizations can eliminate verification delays and foster a culture where quality is a shared responsibility.

toss

The story of how I destroyed (opens in new tab)

Toss Payments modernized its inherited legacy infrastructure by building an OpenStack-based private cloud to operate alongside public cloud providers in an Active-Active hybrid configuration. By overcoming extreme technical debt—including servers burdened with nearly 2,000 manual routing entries—the team achieved a cloud-agnostic deployment environment that ensures high availability and cost efficiency. The transformation demonstrates how a small team can successfully implement complex open-source infrastructure through automation and the rigorous technical internalization of Cluster API and OpenStack. ### The Challenge of Legacy Networking - The inherited infrastructure relied on server-side routing rather than network equipment, meaning every server carried its own routing table. - Some legacy servers contained 1,997 individual routing entries, making manual management nearly impossible and preventing efficient scaling. - Initial attempts to solve this via public cloud (AWS) faced limitations, including rising costs due to exchange rates, lack of deep visibility for troubleshooting, and difficulties in disaster recovery (DR) configuration between public and on-premise environments. ### Scaling OpenStack with a Two-Person Team - Despite having only two engineers with no prior OpenStack experience, the team chose the open-source platform to maintain 100% control over the infrastructure. - The team internalized the technology by installing three different versions of OpenStack dozens of times and simulating various failure scenarios. - Automation was prioritized using Ansible and Terraform to manage the lifecycle of VMs and load balancers, enabling new instance creation in under 10 seconds. - Deep technical tuning was applied, such as modifying the source code of the Octavia load balancer to output custom log formats required for their specific monitoring needs. ### High Availability and Monitoring Strategy - To ensure reliability, the team built three independent OpenStack clusters operating in an Active-Active configuration. - This architecture allows for immediate traffic redirection if a specific cluster fails, minimizing the impact on service availability. - A comprehensive monitoring stack was implemented using Zabbix, Prometheus, Mimir, and Grafana to collect and visualize every essential metric across the private cloud. ### Managing Kubernetes with Cluster API - To replicate the convenience of Public Cloud PaaS (like EKS), the team implemented Cluster API to manage the Kubernetes lifecycle. - Cluster API treats Kubernetes clusters themselves as resources within a management cluster, allowing for standardized and rapid deployment across the private environment. - This approach ensures that developers can deploy applications without needing to distinguish between the underlying cloud providers, fulfilling the goal of "cloud-agnostic" infrastructure. ### Practical Recommendation For organizations dealing with massive technical debt or high public cloud costs, the Toss Payments model suggests that a "Private-First" hybrid approach is viable even with limited headcount. The key is to avoid proprietary black-box solutions and instead invest in the technical internalization of open-source tools like OpenStack and Cluster API, backed by a "code-as-infrastructure" philosophy to ensure scalability and reliability.

toss

Creating Toss's new face (opens in new tab)

Toss redesigned its brand persona graphics to transition from simple, child-like icons to more professional and inclusive human figures that better represent the brand's identity. This update aims to project a more trustworthy and intelligent image while ensuring the visual language is prepared for a global, multi-cultural audience. By balancing iconic simplicity with diverse representation, the new design system maintains brand consistency across various screen sizes and service contexts. ### Refining Proportions for Professionalism * The team adjusted the vertical facial ratio to move away from a "child-like" impression, finding a balance that suggests maturity and intelligence without losing the icon's friendly nature. * The placement of the eyes, nose, and mouth was meticulously tuned to maintain an iconic look while increasing the perceived level of trust. * Structural improvements were made to the body, specifically refining the curves where the neck and shoulders meet to eliminate the unnatural "blocky" feel of previous versions. * A short turtleneck was selected as the default attire to provide a clean, professional, and sophisticated look that works across different UI environments. ### Achieving Gender-Neutral Hairstyles * The design team aimed for "neutrality" in hair design to prevent the characters from being categorized into specific gender roles. * Several iterations were tested, including high-density detailed styles (which were too complex) and simple line-separated styles (which lacked visual density when scaled up). * The final selection focuses on a clean silhouette that follows the head line while adding enough volume to ensure the graphic feels complete and high-quality at any size. ### Implementing Universal Skin Tones and Diversity * To support Toss's expansion into global markets, the team moved away from a single skin tone that could be interpreted as a specific race. * While a "neutral yellow" (similar to standard emojis) was considered, it was ultimately rejected because it felt inconsistent and jarring when displayed in larger formats within the app. * Instead of a single "neutral" color, the team defined a palette of five distinct skin tones based on universal emoji standards. * New guidelines were established to mix these different skin tones in scenes with multiple characters, fostering a sense of inclusivity and representation that reflects a diverse user base. The evolution of the Toss persona illustrates that as a service grows, its visual language must move beyond simple aesthetics to address broader values like trust and inclusivity. Moving forward, the design system will continue to expand to ensure that no user feels excluded by age, gender, or race.

toss

Managing Thousands of API/ (opens in new tab)

Toss Payments manages thousands of API and batch server configurations that handle trillions of won in transactions, where a single typo in a JVM setting can lead to massive financial infrastructure failure. To solve the risks associated with manual "copy-paste" workflows and configuration duplication, the team developed a sophisticated system that treats configuration as code. By implementing layered architectures and dynamic templates, they created a testable, unified environment capable of managing complex hybrid cloud setups with minimal human error. ## Overlay Architecture for Hierarchical Control * The team implemented a layered configuration system consisting of `global`, `cluster`, `phase`, and `application` levels. * Settings are resolved by priority, where lower-level layers override higher-level defaults, allowing servers to inherit common settings while maintaining specific overrides. * This structure allows the team to control environment-specific behaviors, such as disabling canary deployments in development environments, from a single centralized directory. * The directory structure maps files 1:1 to their respective layers, ensuring that naming conventions drive the CI/CD application process. ## Solving Duplication with Template Patterns * Standard YAML overlays often fail when dealing with long strings or arrays, such as `JVM_OPTION`, because changing a single value usually requires redefining the entire block. * To prevent the proliferation of nearly identical environment variables, the team introduced a template pattern using placeholders like `{{MAX_HEAP}}`. * Developers can modify specific parameters at the application layer while the core string remains defined at the global layer, significantly reducing the risk of typos. * This approach ensures that critical settings, like G1GC parameters or heap region sizes, remain consistent across the infrastructure unless explicitly changed. ## Dynamic and Conditional Configuration Logic * The system allows for "evolutionary" configurations where Python scripts can be injected to generate dynamic values, such as random JMX ports or data fetched from remote APIs. * Advanced conditional logic was added to handle complex deployment scenarios, enabling environment variables to change their values automatically based on the target cluster name (e.g., different profiles for AWS vs. IDC). * By treating configuration as a living codebase, the team can adapt to new infrastructure requirements without abandoning their core architectural principles. ## Reliable Batch Processing through Simplicity * For batch operations handling massive settlement volumes, the team prioritized "appropriate technology" and simplicity to minimize failure points. * They chose Jenkins for its low learning curve and reliability, despite its lack of native GitOps support. * To address inconsistencies in manual UI entries and varying Java versions across machines, they standardized the batch infrastructure to ensure that high-stakes financial calculations are executed in a controlled, predictable environment. The most effective way to manage large-scale infrastructure is to transition from static, duplicated configuration files to a dynamic, code-centric system. By combining an overlay architecture for hierarchy and a template pattern for granular changes, organizations can achieve the flexibility needed for hybrid clouds while maintaining the strict safety standards required for financial systems.