toss

Automating Service Vulnerability Analysis Using (opens in new tab)

Toss has developed a high-precision automated vulnerability analysis system by integrating Large Language Models (LLMs) with traditional security testing tools. By evolving their architecture from a simple prompt-based approach to a multi-agent system utilizing open-source models and static analysis, the team achieved over 95% accuracy in threat detection. This project demonstrates that moving beyond a technical proof-of-concept requires solving real-world constraints such as context window limits, output consistency, and long-term financial sustainability.

Navigating Large Codebases with MCP

  • Initial attempts to use RAG (Retrieval Augmented Generation) and repository compression tools failed because the LLM could not maintain complex code relationships within token limits.
  • The team implemented a "SourceCode Browse MCP" (Model Context Protocol) which allows the LLM agent to dynamically query the codebase.
  • By indexing the code, the agent can perform specific tool calls to find function definitions or variable usages only when necessary, effectively bypassing context window restrictions.

Ensuring Consistency via SAST Integration

  • Testing revealed that standalone LLMs produced inconsistent results, often missing known vulnerabilities or generating hallucinations across different runs.
  • To solve this, the team integrated Semgrep, a Static Application Security Testing (SAST) tool, to identify all potential "Source-to-Sink" paths.
  • Semgrep was chosen over CodeQL due to its lighter resource footprint and faster execution, acting as a structured roadmap that ensures the LLM analyzes every suspicious input path without omission.

Optimizing Costs with Multi-Agent Architectures

  • Analyzing every possible code path identified by SAST tools was prohibitively expensive due to high token consumption.
  • The workflow was divided among three specialized agents: a Discovery Agent to filter out irrelevant paths, an Analysis Agent to perform deep logic checks, and a Verification Agent to confirm findings.
  • This "sieve" strategy ensured that the most resource-intensive analysis was only performed on high-probability vulnerabilities, significantly reducing operational costs.

Transitioning to Open Models for Sustainability

  • Scaling the system to hundreds of services and daily PRs made proprietary cloud models financially unviable.
  • After benchmarking models like Llama 3.1 and GPT-OSS, the team selected Qwen3:30B for its 100% coverage rate and high true-positive accuracy in vulnerability detection.
  • To bridge the performance gap between open-source and proprietary models, the team utilized advanced prompt engineering, one-shot learning, and enforced structured JSON outputs to improve reliability.

To build a production-ready AI security tool, teams should focus on the synergy between specialized open-source models and traditional static analysis tools. This hybrid approach provides a cost-effective and sustainable way to achieve enterprise-grade accuracy while maintaining full control over the analysis infrastructure.