*이 글은 연구 개발망에서 진행된 내용을 바탕으로 합니다. 안녕하세요. 토스 Security Researcher 표상영입니다. 지난 글에서는 LLM을 이용해 서비스 취약점 분석을 자동화하면서 마주했던 문제점과 그에 대한 해결책들을 간단히 소개드렸습니다. 이전 글을 작성한 시점부터 벌써 3개월이 지났는데요. 불과 몇 달 사이에 AI의 취약점 분석 능력은 정말 높은 수준으로 올라왔습니다. 이렇게 가파른 기술 발전 속도에 따라, AI를 대하는 저의 자세와 생각도 많이 바뀌게 되었어요. 이번 글에서는…
Published on: March 19, 2026 5 min read GitLab 18.10 brings AI-native triage and remediation Learn about GitLab Duo Agent Platform capabilities that cut noise, surface real vulnerabilities, and turn findings into proposed fixes. product security features GitLab 18.10 introduces…
Toss has developed a high-precision automated vulnerability analysis system by integrating Large Language Models (LLMs) with traditional security testing tools. By evolving their architecture from a simple prompt-based approach to a multi-agent system utilizing open-source models and static analysis, the team achieved over 95% accuracy in threat detection. This project demonstrates that moving beyond a technical proof-of-concept requires solving real-world constraints such as context window limits, output consistency, and long-term financial sustainability.
### Navigating Large Codebases with MCP
* Initial attempts to use RAG (Retrieval Augmented Generation) and repository compression tools failed because the LLM could not maintain complex code relationships within token limits.
* The team implemented a "SourceCode Browse MCP" (Model Context Protocol) which allows the LLM agent to dynamically query the codebase.
* By indexing the code, the agent can perform specific tool calls to find function definitions or variable usages only when necessary, effectively bypassing context window restrictions.
### Ensuring Consistency via SAST Integration
* Testing revealed that standalone LLMs produced inconsistent results, often missing known vulnerabilities or generating hallucinations across different runs.
* To solve this, the team integrated Semgrep, a Static Application Security Testing (SAST) tool, to identify all potential "Source-to-Sink" paths.
* Semgrep was chosen over CodeQL due to its lighter resource footprint and faster execution, acting as a structured roadmap that ensures the LLM analyzes every suspicious input path without omission.
### Optimizing Costs with Multi-Agent Architectures
* Analyzing every possible code path identified by SAST tools was prohibitively expensive due to high token consumption.
* The workflow was divided among three specialized agents: a Discovery Agent to filter out irrelevant paths, an Analysis Agent to perform deep logic checks, and a Verification Agent to confirm findings.
* This "sieve" strategy ensured that the most resource-intensive analysis was only performed on high-probability vulnerabilities, significantly reducing operational costs.
### Transitioning to Open Models for Sustainability
* Scaling the system to hundreds of services and daily PRs made proprietary cloud models financially unviable.
* After benchmarking models like Llama 3.1 and GPT-OSS, the team selected **Qwen3:30B** for its 100% coverage rate and high true-positive accuracy in vulnerability detection.
* To bridge the performance gap between open-source and proprietary models, the team utilized advanced prompt engineering, one-shot learning, and enforced structured JSON outputs to improve reliability.
To build a production-ready AI security tool, teams should focus on the synergy between specialized open-source models and traditional static analysis tools. This hybrid approach provides a cost-effective and sustainable way to achieve enterprise-grade accuracy while maintaining full control over the analysis infrastructure.