토스 / ai-agent

3 posts

toss

Software 3.0 시대, Harness를 통한 조직 생산성 저점 높이기 (opens in new tab)

당신의 팀은 같은 LLM을 쓰고 있나요? 현재 많은 개발팀이 LLM을 도입하고 있지만, 냉정하게 들여다보면 그것은 '각자도생'에 가깝습니다. 같은 모델, 같은 IDE를 쓰는데도 결과물의 차이는 극심합니다. 어떤 엔지니어는 '컨텍스트 엔지니어링(Context Engineering)'에 대한 높은 이해도로 LLM에게 정확한 역할을 부여해 10분 만에 복잡한 리팩토링을 끝냅니다. 반면, 어떤 엔지니어는 단순한 질문과 답변을 반복하며 할루시네이션과 씨름하느라 1시간을 허비하죠. 예를 들어, 같은 레포지…

toss

소프트웨어 3.0 시대를 맞이하며 (opens in new tab)

The tech industry is shifting from Software 1.0 (explicit logic) and 2.0 (neural networks) into Software 3.0, where natural language prompts and autonomous agents act as the primary programming interface. While Large Language Models (LLMs) are the engines of this era, they require a "Harness"—a structured environment of tools and protocols—to perform real-world tasks effectively. This evolution does not render traditional engineering obsolete; instead, it demonstrates that robust architectural principles like layered design and separation of powers are essential for building reliable AI agents. ### The Evolution of Software 3.0 * Software 1.0 is defined by explicit "How" logic written in languages like Python or Java, while Software 2.0 focuses on weights and data in neural networks. * Software 3.0, popularized by Andrej Karpathy, moves to "What" logic, where natural language prompts drive the execution. * The "Harness" concept is critical: just as a horse needs a harness to be useful to a human, an LLM needs tools (CLI, API access, file systems) to move from a chatbot to a functional agent like Claude Code. ### Mapping Agent Architecture to Traditional Layers * **Slash Commands as Controllers:** Tools like `/review` or `/refactor` act as entry points for user requests, similar to REST controllers in Spring or Express. * **Sub-agents as the Service Layer:** Sub-agents coordinate multiple skills and maintain independent context, mirroring how services orchestrate domain objects and repositories. * **Skills as Domain Components:** Following the Single Responsibility Principle (SRP), individual skills should handle one clear task (e.g., "generating tests") to prevent logic bloat. * **MCP as Infrastructure/Adapters:** The Model Context Protocol (MCP) functions like the Repository or Adapter pattern, abstracting external systems like databases and APIs from the core logic. * **CLAUDE.md as Configuration:** Project-specific rules and tech stacks are stored in metadata files, acting as the `package.json` or `pom.xml` of the agent environment. ### From Exceptions to Questions * Traditional 1.0 software must have every branch of logic predefined; if an unknown state is reached, the system throws an exception or fails. * Software 3.0 introduces Human-in-the-Loop (HITL), where "Exceptions" become "Questions," allowing the agent to ask for clarification on high-risk or ambiguous tasks. * Effective agent design requires identifying when to act autonomously (reversible, low-risk tasks) versus when to delegate decisions to a human (deployments, deletions, or high-cost API calls). ### Managing Constraints: Tokens and Complexity * In Software 3.0, tokens represent the "memory" (RAM) of the system; large codebases can lead to "token explosion," causing context overflow or high costs. * Deterministic logic should be moved to external scripts rather than being interpreted by the LLM every time to save tokens and ensure consistency. * To avoid "Skill Explosion" (similar to Class Explosion), developers should use "Progressive Disclosure," providing the agent with a high-level entry point and only loading detailed task knowledge when specifically required. Traditional software engineering expertise—specifically in cohesion, coupling, and abstraction—is the most valuable asset when transitioning to Software 3.0. By treating prompt engineering and agent orchestration with the same architectural rigor as 1.0 code, developers can build agents that are scalable, maintainable, and truly useful.

toss

세금 환급 자동화 : AI-driven UI 테스트 자동화 일지 (opens in new tab)

At Toss Income, QA Manager Suho Jung successfully automated complex E2E testing for diverse tax refund services by leveraging AI as specialized virtual team members. By shifting from manual coding to a "human-as-orchestrator" model, a single person achieved the productivity of a four-to-five-person automation team within just five months. This approach overcame the inherent brittleness of testing long, React-based flows that are subject to frequent policy changes and external system dependencies. ### Challenges in Tax Service Automation The complexity of tax refund services presented unique hurdles that made traditional manual automation unsustainable: * **Multi-Step Dependencies:** Each refund flow averages 15–20 steps involving internal systems, authentication providers, and HomeTax scraping servers, where a single timing glitch can fail the entire test. * **Frequent UI and Policy Shifts:** Minor UI updates or new tax laws required total scenario reconfigurations, making hard-coded tests obsolete almost immediately. * **Environmental Instability:** Issues such as "Target closed" errors during scraping, differing domain environments, and React-specific hydration delays caused constant test flakiness. ### Building an AI-Driven QA Team Rather than using AI as a simple autocomplete tool, the project assigned specific "personas" to different AI models to handle distinct parts of the lifecycle: * **SDET Agent (Claude Sonnet 4.5):** Acted as the lead developer, responsible for designing the Page Object Model (POM) architecture, writing test logic, and creating utility functions. * **Documentation Specialist:** Automatically generated daily retrospectives and updated technical guides by analyzing daily git commits. * **Git Master:** Managed commit history and PR descriptions to ensure high-quality documentation of the project’s evolution. * **Pair Programmers (Cursor & Codex):** Handled real-time troubleshooting, type errors, and comparative analysis of different test scripts. ### Technical Solutions for React and Policy Logic The team implemented several sophisticated technical strategies to ensure test stability: * **React Interaction Readiness:** To solve "Element is not clickable" errors, they developed a strategy that waits not just for visibility, but for event handlers to bind to the DOM (Hydration). * **Safe Interaction Fallbacks:** A standard `click` utility was created that attempts a Playwright click, then a native keyboard 'Enter' press, and finally a JS dispatch to ensure interactions succeed even during UI transitions. * **Dynamic Consent Flow Utility:** A specialized system was built to automatically detect and handle varying "Terms of Service" agreements across different sub-services (Tax Secretary, Hidden Refund, etc.) through a single unified function. * **Test Isolation:** Automated scripts were used to prevent `userNo` (test ID) collisions, ensuring 35+ complex scenarios could run in parallel without data interference. ### Integrated Feedback and Reporting The automation was integrated directly into internal communication channels to create a tight feedback loop: * **Messenger Notifications:** Every test run sends a report including execution time, test IDs, and environment data to the team's messenger. * **Automated Failure Analysis:** When a test fails, the AI automatically posts the error log, the specific failed step, a tracking EventID, and a screenshot as a thread reply for immediate debugging. * **Human-AI Collaboration:** This structure shifted the QA's role from writing code to discussing failures and policy changes within the messenger threads. The success of this 5-month experiment suggests that for high-complexity environments, the future of QA lies in "AI Orchestration." Instead of focusing on writing selectors, QA engineers should focus on defining problems and managing the AI agents that build the architecture.