AWS Weekly Roundup: Claude Opus 4.7 in Amazon Bedrock, AWS Interconnect GA, and more (April 20, 2026) Last week I had the honor of delivering a commencement speech at the University of Namur (uNamur) for their 2025 graduation ceremony. Standing in front of freshly minted compute…
Published on: April 16, 2026 5 min read GitLab 18.11: CI Expert and Data Analyst AI agents target development gaps Set up CI and query your software development lifecycle data with two new GitLab Duo Agent Platform foundational agents available in GitLab 18.11. AI/ML features pr…
Vulnerability resolution generally available on GitLab Duo Agent Platform Available in: Ultimate Offerings: GitLab Self-Managed, GitLab.com, GitLab Dedicated Links: Documentation | Related issue Agentic SAST Vulnerability Resolution is now generally available in GitLab 18.11 on…
Published on: April 15, 2026 12 min read A guide to the breaking changes in GitLab 19.0 GitLab 19.0 removes several deprecated features. Learn what's changing, which changes affect your deployment, and how to prepare before upgrading. product news GitLab 17.0 shipped with 80 bre…
Published on: April 9, 2026 14 min read 5 ways GitLab pipeline logic solves real engineering problems Learn how to scale CI/CD with composable patterns for monorepos, microservices, environments, and governance. CI/CD DevOps platform tutorial features Most CI/CD tools can run a…
AWS Weekly Roundup: AWS DevOps Agent & Security Agent GA, Product Lifecycle updates, and more (April 6, 2026) Last week, I visited AWS Hong Kong User Group with my team. Hong Kong has a small but strong community, and their energy and passion are high. They recently started a ne…
들어가며 안녕하세요, LINE 서비스의 광고 시스템에서 데이터 파이프라인과 데이터 플랫폼 운영을 담당하고 있는 박민재, 손정호, 정창권입니다. LINE 광고 플랫폼(이하 LINE Ads)은 하루에 수십억 건 이상의 광고를 송출하며, 내부에서는 천억 건에 준하는 데이터를 수집 및 가공하고 있습니다. LINE Ads의 데이터 파이프라인 팀은 광고 효율을 높이기 위해 실시간으로 광고 결과 데이터를 수집, 가공, 저장, 전송하는 역할을 수행합니다. 데이터를 처리하는 과정에서 이벤트 적합성 여부(어뷰징…
Our First 2026 Heroes Cohort Is Here! We’re thrilled to celebrate three exceptional developer community leaders as AWS Heroes. These individuals represent the heart of what makes the AWS community so vibrant. In addition to sharing technical knowledge, they build connections, fo…
Published on: March 4, 2026 7 min read 10 AI prompts to speed your team’s software delivery Eliminate review backlogs, security delays, and coordination overhead with ready-to-use AI prompts covering every stage of the software lifecycle. AI/ML DevOps platform AI-assisted coding…
Published on: February 27, 2026 4 min read AI can detect vulnerabilities, but who governs risk? AI-assisted vulnerability detection is developing fast, but the harder challenges of enforcement, governance, and supply chain security require a holistic platform. AI/ML security Ant…
Published on: February 26, 2026 5 min read GitLab Duo Agent Platform with Claude accelerates development Learn how to leverage external AI models like Anthropic's Claude to automate everything from code generation to pipeline creation directly within GitLab. product AI/ML featur…
Published on: February 25, 2026 5 min read New GitLab metrics and registry features help reduce CI/CD bottlenecks See how CI/CD Job Performance Metrics and Container Virtual Registry, currently in beta, help platform teams quickly spot slow jobs and simplify multi-registry conta…
들어가며 안녕하세요. LINE NEXT DevOps 팀에서 일하고 있는 이동원입니다. 저는 쿠버네티스 기반 인프라 운영과 CI/CD 구축, 모니터링 및 장애 대응 등 인프라 운영 관리 전반의 업무를 담당하고 있으며, 최근에는 AI를 활용한 개발 생산성 향상과 자동화에 깊은 관심을 두고 관련 학습과 실험을 병행하고 있습니다. 다양한 AI 모델과 도구를 테스트하며, 어떻게 하면 AI를 팀 전체의 개발 프로세스에 자연스럽게 통합할 수 있을지 고민하고 있습니다. 이번 글에서는 LINE NEXT에서 AI…
Toss Payments modernized its inherited legacy infrastructure by building an OpenStack-based private cloud to operate alongside public cloud providers in an Active-Active hybrid configuration. By overcoming extreme technical debt—including servers burdened with nearly 2,000 manual routing entries—the team achieved a cloud-agnostic deployment environment that ensures high availability and cost efficiency. The transformation demonstrates how a small team can successfully implement complex open-source infrastructure through automation and the rigorous technical internalization of Cluster API and OpenStack.
### The Challenge of Legacy Networking
- The inherited infrastructure relied on server-side routing rather than network equipment, meaning every server carried its own routing table.
- Some legacy servers contained 1,997 individual routing entries, making manual management nearly impossible and preventing efficient scaling.
- Initial attempts to solve this via public cloud (AWS) faced limitations, including rising costs due to exchange rates, lack of deep visibility for troubleshooting, and difficulties in disaster recovery (DR) configuration between public and on-premise environments.
### Scaling OpenStack with a Two-Person Team
- Despite having only two engineers with no prior OpenStack experience, the team chose the open-source platform to maintain 100% control over the infrastructure.
- The team internalized the technology by installing three different versions of OpenStack dozens of times and simulating various failure scenarios.
- Automation was prioritized using Ansible and Terraform to manage the lifecycle of VMs and load balancers, enabling new instance creation in under 10 seconds.
- Deep technical tuning was applied, such as modifying the source code of the Octavia load balancer to output custom log formats required for their specific monitoring needs.
### High Availability and Monitoring Strategy
- To ensure reliability, the team built three independent OpenStack clusters operating in an Active-Active configuration.
- This architecture allows for immediate traffic redirection if a specific cluster fails, minimizing the impact on service availability.
- A comprehensive monitoring stack was implemented using Zabbix, Prometheus, Mimir, and Grafana to collect and visualize every essential metric across the private cloud.
### Managing Kubernetes with Cluster API
- To replicate the convenience of Public Cloud PaaS (like EKS), the team implemented Cluster API to manage the Kubernetes lifecycle.
- Cluster API treats Kubernetes clusters themselves as resources within a management cluster, allowing for standardized and rapid deployment across the private environment.
- This approach ensures that developers can deploy applications without needing to distinguish between the underlying cloud providers, fulfilling the goal of "cloud-agnostic" infrastructure.
### Practical Recommendation
For organizations dealing with massive technical debt or high public cloud costs, the Toss Payments model suggests that a "Private-First" hybrid approach is viable even with limited headcount. The key is to avoid proprietary black-box solutions and instead invest in the technical internalization of open-source tools like OpenStack and Cluster API, backed by a "code-as-infrastructure" philosophy to ensure scalability and reliability.
Naver’s Integrated Search team is transitioning from manual fault response to an automated system using LLM Agents to manage the increasing complexity of search infrastructure. By integrating Large Language Models into the DevOps pipeline, the system evolves through accumulated experience, moving beyond simple alert monitoring to intelligent diagnostic analysis and action recommendation.
### Limitations of Traditional Fault Response
* **Complex Search Flows:** Naver’s search architecture involves multiple interdependent layers, which makes manual root cause analysis slow and prone to human error.
* **Fragmented Context:** Existing monitoring requires developers to manually synthesize logs and metrics from disparate telemetry sources, leading to high cognitive load during outages.
* **Delayed Intervention:** Human-led responses often suffer from a "detection-to-action" lag, especially during high-traffic periods or subtle service regressions.
### Architecture of DevOps Agent v1
* **Initial Design:** Focused on automating basic data gathering and providing preliminary textual reports to engineers.
* **Infrastructure Integration:** Built using a specialized software stack designed to bridge frontend (FE) and backend (BE) telemetry within the search infrastructure.
* **Standardized Logic:** The v1 agent operated on a fixed set of instructions to perform predefined diagnostic tasks when triggered by specific system alarms.
### Evolution to DevOps Agent v2
* **Overcoming V1 Limitations:** The first iteration struggled with maintaining deep context and providing diverse actionable insights, necessitating a more robust agentic structure.
* **Enhanced Memory and Learning:** V2 incorporates a more sophisticated architecture that allows the agent to reference historical failure data and learn from past incident resolutions.
* **Advanced Tool Interaction:** The system was upgraded to handle more complex tool-calling capabilities, allowing the agent to interact more deeply with internal infrastructure APIs.
### System Operations and Evaluation
* **Trigger Queue Management:** Implements a queuing system to efficiently process and prioritize multiple concurrent system alerts without overwhelming the diagnostic pipeline.
* **Anomaly Detection:** Utilizes advanced detection methods to distinguish between routine traffic fluctuations and genuine service anomalies that require LLM intervention.
* **Rigorous Evaluation:** The agent’s performance is measured through a dedicated evaluation framework that assesses the accuracy of its diagnoses against known ground-truth incidents.
### Scaling and Future Challenges
* **Context Expansion:** Efforts are focused on integrating a wider range of metadata and environmental context to provide a holistic view of system health.
* **Action Recommendation:** The system is moving toward suggesting specific recovery actions, such as rollbacks or traffic rerouting, rather than just identifying the problem.
* **Sustainability:** Ensuring the DevOps Agent remains maintainable and cost-effective as the underlying search infrastructure and LLM models continue to evolve.
Organizations managing high-scale search traffic should consider LLM-based agents as integrated infrastructure components rather than standalone tools. Moving from reactive monitoring to a proactive, experience-based agent system is essential for reducing the mean time to recovery (MTTR) in complex distributed environments.