devops

3 posts

toss

How I Tole Down Our Legacy (opens in new tab)

Toss Payments modernized its inherited legacy infrastructure by building an OpenStack-based private cloud to operate alongside public cloud providers in an Active-Active hybrid configuration. By overcoming extreme technical debt—including servers burdened with nearly 2,000 manual routing entries—the team achieved a cloud-agnostic deployment environment that ensures high availability and cost efficiency. The transformation demonstrates how a small team can successfully implement complex open-source infrastructure through automation and the rigorous technical internalization of Cluster API and OpenStack. ### The Challenge of Legacy Networking - The inherited infrastructure relied on server-side routing rather than network equipment, meaning every server carried its own routing table. - Some legacy servers contained 1,997 individual routing entries, making manual management nearly impossible and preventing efficient scaling. - Initial attempts to solve this via public cloud (AWS) faced limitations, including rising costs due to exchange rates, lack of deep visibility for troubleshooting, and difficulties in disaster recovery (DR) configuration between public and on-premise environments. ### Scaling OpenStack with a Two-Person Team - Despite having only two engineers with no prior OpenStack experience, the team chose the open-source platform to maintain 100% control over the infrastructure. - The team internalized the technology by installing three different versions of OpenStack dozens of times and simulating various failure scenarios. - Automation was prioritized using Ansible and Terraform to manage the lifecycle of VMs and load balancers, enabling new instance creation in under 10 seconds. - Deep technical tuning was applied, such as modifying the source code of the Octavia load balancer to output custom log formats required for their specific monitoring needs. ### High Availability and Monitoring Strategy - To ensure reliability, the team built three independent OpenStack clusters operating in an Active-Active configuration. - This architecture allows for immediate traffic redirection if a specific cluster fails, minimizing the impact on service availability. - A comprehensive monitoring stack was implemented using Zabbix, Prometheus, Mimir, and Grafana to collect and visualize every essential metric across the private cloud. ### Managing Kubernetes with Cluster API - To replicate the convenience of Public Cloud PaaS (like EKS), the team implemented Cluster API to manage the Kubernetes lifecycle. - Cluster API treats Kubernetes clusters themselves as resources within a management cluster, allowing for standardized and rapid deployment across the private environment. - This approach ensures that developers can deploy applications without needing to distinguish between the underlying cloud providers, fulfilling the goal of "cloud-agnostic" infrastructure. ### Practical Recommendation For organizations dealing with massive technical debt or high public cloud costs, the Toss Payments model suggests that a "Private-First" hybrid approach is viable even with limited headcount. The key is to avoid proprietary black-box solutions and instead invest in the technical internalization of open-source tools like OpenStack and Cluster API, backed by a "code-as-infrastructure" philosophy to ensure scalability and reliability.

naver

Naver Integrated Search LLM DevOps (opens in new tab)

Naver’s Integrated Search team is transitioning from manual fault response to an automated system using LLM Agents to manage the increasing complexity of search infrastructure. By integrating Large Language Models into the DevOps pipeline, the system evolves through accumulated experience, moving beyond simple alert monitoring to intelligent diagnostic analysis and action recommendation. ### Limitations of Traditional Fault Response * **Complex Search Flows:** Naver’s search architecture involves multiple interdependent layers, which makes manual root cause analysis slow and prone to human error. * **Fragmented Context:** Existing monitoring requires developers to manually synthesize logs and metrics from disparate telemetry sources, leading to high cognitive load during outages. * **Delayed Intervention:** Human-led responses often suffer from a "detection-to-action" lag, especially during high-traffic periods or subtle service regressions. ### Architecture of DevOps Agent v1 * **Initial Design:** Focused on automating basic data gathering and providing preliminary textual reports to engineers. * **Infrastructure Integration:** Built using a specialized software stack designed to bridge frontend (FE) and backend (BE) telemetry within the search infrastructure. * **Standardized Logic:** The v1 agent operated on a fixed set of instructions to perform predefined diagnostic tasks when triggered by specific system alarms. ### Evolution to DevOps Agent v2 * **Overcoming V1 Limitations:** The first iteration struggled with maintaining deep context and providing diverse actionable insights, necessitating a more robust agentic structure. * **Enhanced Memory and Learning:** V2 incorporates a more sophisticated architecture that allows the agent to reference historical failure data and learn from past incident resolutions. * **Advanced Tool Interaction:** The system was upgraded to handle more complex tool-calling capabilities, allowing the agent to interact more deeply with internal infrastructure APIs. ### System Operations and Evaluation * **Trigger Queue Management:** Implements a queuing system to efficiently process and prioritize multiple concurrent system alerts without overwhelming the diagnostic pipeline. * **Anomaly Detection:** Utilizes advanced detection methods to distinguish between routine traffic fluctuations and genuine service anomalies that require LLM intervention. * **Rigorous Evaluation:** The agent’s performance is measured through a dedicated evaluation framework that assesses the accuracy of its diagnoses against known ground-truth incidents. ### Scaling and Future Challenges * **Context Expansion:** Efforts are focused on integrating a wider range of metadata and environmental context to provide a holistic view of system health. * **Action Recommendation:** The system is moving toward suggesting specific recovery actions, such as rollbacks or traffic rerouting, rather than just identifying the problem. * **Sustainability:** Ensuring the DevOps Agent remains maintainable and cost-effective as the underlying search infrastructure and LLM models continue to evolve. Organizations managing high-scale search traffic should consider LLM-based agents as integrated infrastructure components rather than standalone tools. Moving from reactive monitoring to a proactive, experience-based agent system is essential for reducing the mean time to recovery (MTTR) in complex distributed environments.

line

LY's Tech Conference, 'Tech (opens in new tab)

LY Corporation’s Tech-Verse 2025 conference highlighted the company's strategic pivot toward becoming an AI-centric organization through the "Catalyst One Platform" initiative. By integrating the disparate infrastructures of LINE and Yahoo! JAPAN into a unified private cloud, the company aims to achieve massive cost efficiencies while accelerating the deployment of AI agents across its entire service ecosystem. This transformation focuses on empowering engineers with AI-driven development tools to foster rapid innovation and deliver a seamless, "WOW" experience for global users. ### Infrastructure Integration and the Catalyst One Platform To address the redundancies following the merger of LINE and Yahoo! JAPAN, LY Corporation is consolidating its technical foundations into a single internal ecosystem known as the Catalyst One Platform. * **Private Cloud Advantage:** The company maintains its own private cloud to achieve a four-fold cost reduction compared to public cloud alternatives, managed by a lean team of 700 people supporting 500,000 servers. * **Unified Architecture:** The integration spans several layers, including Infrastructure (Project "DC-Hub"), Cloud (Project "Flava"), and specialized Data and AI platforms. * **Next-Generation Cloud "Flava":** This platform integrates existing services to enhance VM specifications, VPC networking, and high-performance object storage (Ceph and Dragon). * **Information Security:** A dedicated "SafeOps" framework is being implemented to provide governance and security across all integrated services, ensuring a safer environment for user data. ### AI Strategy and Service Agentization A core pillar of LY’s strategy is the "AI Agentization" of all its services, moving beyond simple features to proactive, personalized assistance. * **Scaling GenAI:** Generative AI has already been integrated into 44 different services within the group. * **Personalized Agents:** The company is developing the capacity to generate millions of specialized agents that can be linked together to support the unique needs of individual users. * **Agent Ecosystem:** The goal is to move from a standard platform model to one where every user interaction is mediated by an intelligent agent. ### AI-Driven Development Transformation Beyond user-facing services, LY is fundamentally changing how its engineers work by deploying internal AI development solutions to all staff starting in July. * **Code and Test Automation:** Proof of Concept (PoC) results showed a 96% accuracy rate for "Code Assist" and a 97% reduction in time for "Auto Test" procedures. * **RAG Integration:** The system utilizes Retrieval-Augmented Generation (RAG) to leverage internal company knowledge and guidelines, ensuring high-quality, context-aware development support. * **Efficiency Gains:** By automating repetitive tasks, the company intends for engineers to shift their focus from maintenance to creative service improvement and innovation. The successful integration of these platforms and the aggressive adoption of AI-driven development tools suggest that LY Corporation is positioning itself to be a leader in the "AI-agent" era. For technical organizations, LY's model serves as a case study in how large-scale mergers can leverage private cloud infrastructure to fund and accelerate a company-wide AI transition.