gitops

2 posts

toss

Managing Thousands of API/Batch Servers (opens in new tab)

Toss Payments manages thousands of API and batch server configurations that handle trillions of won in transactions, where a single typo in a JVM setting can lead to massive financial infrastructure failure. To solve the risks associated with manual "copy-paste" workflows and configuration duplication, the team developed a sophisticated system that treats configuration as code. By implementing layered architectures and dynamic templates, they created a testable, unified environment capable of managing complex hybrid cloud setups with minimal human error. ## Overlay Architecture for Hierarchical Control * The team implemented a layered configuration system consisting of `global`, `cluster`, `phase`, and `application` levels. * Settings are resolved by priority, where lower-level layers override higher-level defaults, allowing servers to inherit common settings while maintaining specific overrides. * This structure allows the team to control environment-specific behaviors, such as disabling canary deployments in development environments, from a single centralized directory. * The directory structure maps files 1:1 to their respective layers, ensuring that naming conventions drive the CI/CD application process. ## Solving Duplication with Template Patterns * Standard YAML overlays often fail when dealing with long strings or arrays, such as `JVM_OPTION`, because changing a single value usually requires redefining the entire block. * To prevent the proliferation of nearly identical environment variables, the team introduced a template pattern using placeholders like `{{MAX_HEAP}}`. * Developers can modify specific parameters at the application layer while the core string remains defined at the global layer, significantly reducing the risk of typos. * This approach ensures that critical settings, like G1GC parameters or heap region sizes, remain consistent across the infrastructure unless explicitly changed. ## Dynamic and Conditional Configuration Logic * The system allows for "evolutionary" configurations where Python scripts can be injected to generate dynamic values, such as random JMX ports or data fetched from remote APIs. * Advanced conditional logic was added to handle complex deployment scenarios, enabling environment variables to change their values automatically based on the target cluster name (e.g., different profiles for AWS vs. IDC). * By treating configuration as a living codebase, the team can adapt to new infrastructure requirements without abandoning their core architectural principles. ## Reliable Batch Processing through Simplicity * For batch operations handling massive settlement volumes, the team prioritized "appropriate technology" and simplicity to minimize failure points. * They chose Jenkins for its low learning curve and reliability, despite its lack of native GitOps support. * To address inconsistencies in manual UI entries and varying Java versions across machines, they standardized the batch infrastructure to ensure that high-stakes financial calculations are executed in a controlled, predictable environment. The most effective way to manage large-scale infrastructure is to transition from static, duplicated configuration files to a dynamic, code-centric system. By combining an overlay architecture for hierarchy and a template pattern for granular changes, organizations can achieve the flexibility needed for hybrid clouds while maintaining the strict safety standards required for financial systems.

line

Connecting Thousands of LY Corporation Services (opens in new tab)

LY Corporation developed a centralized control plane using Central Dogma to manage service-to-service communication across its vast, heterogeneous infrastructure of physical machines, virtual machines, and Kubernetes clusters. By adopting the industry-standard xDS protocol, the new system resolves the interoperability and scaling limitations of their legacy platform while providing a robust GitOps-based workflow. This architecture enables the company to connect thousands of services with high reliability and sophisticated traffic control capabilities. ## Limitations of the Legacy System The previous control plane environment faced several architectural bottlenecks that hindered developer productivity and system flexibility: * **Tight Coupling:** The system was heavily dependent on a specific internal project management tool (PMC), making it difficult to support modern containerized environments like Kubernetes. * **Proprietary Schemas:** Communication relied on custom message schemas, which created interoperability issues between different clients and versions. * **Lack of Dynamic Registration:** The legacy setup could not handle dynamic endpoint registration effectively, functioning more as a static registry than a functional service mesh control plane. * **Limited Traffic Control:** It lacked the ability to perform complex routing tasks, such as canary releases or advanced client-side load balancing, across diverse infrastructures. ## Central Dogma as a Control Plane To solve these issues, the team leveraged Central Dogma, a Git-based repository service for textual configuration, to act as the foundation for a new control plane: * **xDS Protocol Integration:** The new control plane implements the industry-standard xDS protocol, ensuring seamless compatibility with Envoy and other modern data plane proxies. * **GitOps Workflow:** By utilizing Central Dogma’s mirroring features, developers can manage service configurations and traffic policies safely through Pull Requests in external Git repositories. * **High Reliability:** The system inherits Central Dogma’s native strengths, including multi-datacenter replication, high availability, and a robust authorization system. * **Schema Evolution:** The control plane automatically transforms legacy metadata into standard xDS resources, allowing for a smooth transition from old infrastructure to the new service mesh. ## Dynamic Service Discovery and Registration The architecture provides automated ways to manage service endpoints across different environments: * **Kubernetes Endpoint Plugin:** A dedicated plugin watches for changes in Kubernetes services and automatically updates the xDS resource tree in Central Dogma. * **Automated API Registration:** The system provides gRPC and HTTP APIs (e.g., `RegisterLocalityLbEndpoint`) that allow services to register themselves dynamically during the startup process. * **Advanced Traffic Features:** The new control plane supports sophisticated features like zone-aware routing, circuit breakers, automatic retries, and "slow start" mechanisms for new endpoints. ## Evolution Toward Sidecar-less Service Mesh A major focus of the project is improving the developer experience by reducing the operational overhead of the data plane: * **Sidecar-less Options:** The team is working toward providing service mesh benefits without requiring a sidecar proxy for every pod, which reduces resource consumption and simplifies debugging. * **Unified Control:** Central Dogma acts as a single source of truth for both proxy-based and proxyless service mesh configurations, ensuring consistent policy enforcement across the entire organization. For organizations managing large-scale, heterogeneous infrastructure, transitioning to an xDS-compliant control plane backed by a reliable Git-based configuration store is highly recommended. This approach balances the need for high-speed dynamic updates with the safety and auditability of GitOps, ultimately allowing for a more scalable and developer-friendly service mesh.