안녕하세요. LY Corporation에서 프라이빗 클라우드 인프라를 담당하고 있는 이노우에입니다. LY Corporation의 방대한 트래픽과 데이터를 지탱하는 것은 저희가 직접 개발해 운영하고 있는 대규모 프라이빗 클라우드입니다. 현재 저희는 구 LINE Corporation에서 사용하던 'Verda'와 구 Yahoo Japan Corporation에서 사용하던 'YNW(IaaS(infrastructure as a service))'라는 두 거대한 클라우드 기반을 차세대 클라우드 기반인 Fl…
The truly programmable SASE platform 2026-03-02 Abe Carryl Every organization approaches security through a unique lens, shaped by their tooling, requirements, and history. No two environments look the same, and none stay static for long. We believe the platforms that protect th…
들어가며 안녕하세요. 개발 서비스용 프라이빗 클라우드를 담당하고 있는 Cloud Service CBU 박영희입니다. LY Corporation은 서비스 개발에 필요한 인프라와 플랫폼을 제공하기 위한 프라이빗 클라우드를 내부에서 구축해 사용하고 있으며, LY Corporation으로 합병 전에 Yahoo! JAPAN과 LINE에서 사용하던 클라우드 서비스를 하나로 통합하고 있습니다. 새로운 통합 프라이빗 클라우드의 이름은 'Flava'입니다. 이 글에서는 클라우드 산업 전체가 어떻게 진화할지 말씀…
AWS Weekly Roundup: Amazon Bedrock agent workflows, Amazon SageMaker private connectivity, and more (February 2, 2026) Over the past week, we passed Laba festival, a traditional marker in the Chinese calendar that signals the final stretch leading up to the Lunar New Year. For m…
Toss Payments modernized its inherited legacy infrastructure by building an OpenStack-based private cloud to operate alongside public cloud providers in an Active-Active hybrid configuration. By overcoming extreme technical debt—including servers burdened with nearly 2,000 manual routing entries—the team achieved a cloud-agnostic deployment environment that ensures high availability and cost efficiency. The transformation demonstrates how a small team can successfully implement complex open-source infrastructure through automation and the rigorous technical internalization of Cluster API and OpenStack.
### The Challenge of Legacy Networking
- The inherited infrastructure relied on server-side routing rather than network equipment, meaning every server carried its own routing table.
- Some legacy servers contained 1,997 individual routing entries, making manual management nearly impossible and preventing efficient scaling.
- Initial attempts to solve this via public cloud (AWS) faced limitations, including rising costs due to exchange rates, lack of deep visibility for troubleshooting, and difficulties in disaster recovery (DR) configuration between public and on-premise environments.
### Scaling OpenStack with a Two-Person Team
- Despite having only two engineers with no prior OpenStack experience, the team chose the open-source platform to maintain 100% control over the infrastructure.
- The team internalized the technology by installing three different versions of OpenStack dozens of times and simulating various failure scenarios.
- Automation was prioritized using Ansible and Terraform to manage the lifecycle of VMs and load balancers, enabling new instance creation in under 10 seconds.
- Deep technical tuning was applied, such as modifying the source code of the Octavia load balancer to output custom log formats required for their specific monitoring needs.
### High Availability and Monitoring Strategy
- To ensure reliability, the team built three independent OpenStack clusters operating in an Active-Active configuration.
- This architecture allows for immediate traffic redirection if a specific cluster fails, minimizing the impact on service availability.
- A comprehensive monitoring stack was implemented using Zabbix, Prometheus, Mimir, and Grafana to collect and visualize every essential metric across the private cloud.
### Managing Kubernetes with Cluster API
- To replicate the convenience of Public Cloud PaaS (like EKS), the team implemented Cluster API to manage the Kubernetes lifecycle.
- Cluster API treats Kubernetes clusters themselves as resources within a management cluster, allowing for standardized and rapid deployment across the private environment.
- This approach ensures that developers can deploy applications without needing to distinguish between the underlying cloud providers, fulfilling the goal of "cloud-agnostic" infrastructure.
### Practical Recommendation
For organizations dealing with massive technical debt or high public cloud costs, the Toss Payments model suggests that a "Private-First" hybrid approach is viable even with limited headcount. The key is to avoid proprietary black-box solutions and instead invest in the technical internalization of open-source tools like OpenStack and Cluster API, backed by a "code-as-infrastructure" philosophy to ensure scalability and reliability.
Toss Payments manages thousands of API and batch server configurations that handle trillions of won in transactions, where a single typo in a JVM setting can lead to massive financial infrastructure failure. To solve the risks associated with manual "copy-paste" workflows and configuration duplication, the team developed a sophisticated system that treats configuration as code. By implementing layered architectures and dynamic templates, they created a testable, unified environment capable of managing complex hybrid cloud setups with minimal human error.
## Overlay Architecture for Hierarchical Control
* The team implemented a layered configuration system consisting of `global`, `cluster`, `phase`, and `application` levels.
* Settings are resolved by priority, where lower-level layers override higher-level defaults, allowing servers to inherit common settings while maintaining specific overrides.
* This structure allows the team to control environment-specific behaviors, such as disabling canary deployments in development environments, from a single centralized directory.
* The directory structure maps files 1:1 to their respective layers, ensuring that naming conventions drive the CI/CD application process.
## Solving Duplication with Template Patterns
* Standard YAML overlays often fail when dealing with long strings or arrays, such as `JVM_OPTION`, because changing a single value usually requires redefining the entire block.
* To prevent the proliferation of nearly identical environment variables, the team introduced a template pattern using placeholders like `{{MAX_HEAP}}`.
* Developers can modify specific parameters at the application layer while the core string remains defined at the global layer, significantly reducing the risk of typos.
* This approach ensures that critical settings, like G1GC parameters or heap region sizes, remain consistent across the infrastructure unless explicitly changed.
## Dynamic and Conditional Configuration Logic
* The system allows for "evolutionary" configurations where Python scripts can be injected to generate dynamic values, such as random JMX ports or data fetched from remote APIs.
* Advanced conditional logic was added to handle complex deployment scenarios, enabling environment variables to change their values automatically based on the target cluster name (e.g., different profiles for AWS vs. IDC).
* By treating configuration as a living codebase, the team can adapt to new infrastructure requirements without abandoning their core architectural principles.
## Reliable Batch Processing through Simplicity
* For batch operations handling massive settlement volumes, the team prioritized "appropriate technology" and simplicity to minimize failure points.
* They chose Jenkins for its low learning curve and reliability, despite its lack of native GitOps support.
* To address inconsistencies in manual UI entries and varying Java versions across machines, they standardized the batch infrastructure to ensure that high-stakes financial calculations are executed in a controlled, predictable environment.
The most effective way to manage large-scale infrastructure is to transition from static, duplicated configuration files to a dynamic, code-centric system. By combining an overlay architecture for hierarchy and a template pattern for granular changes, organizations can achieve the flexibility needed for hybrid clouds while maintaining the strict safety standards required for financial systems.
At Ignite, Microsoft just announced Team customizations and imaging for Microsoft Dev Box. The goal for this feature is to improve developer productivity and happiness by reducing the time it takes to setup and maintain development environments. Team customizations began as an i…