k8s

7 posts

toss

How I Tole Down Our Legacy (opens in new tab)

Toss Payments modernized its inherited legacy infrastructure by building an OpenStack-based private cloud to operate alongside public cloud providers in an Active-Active hybrid configuration. By overcoming extreme technical debt—including servers burdened with nearly 2,000 manual routing entries—the team achieved a cloud-agnostic deployment environment that ensures high availability and cost efficiency. The transformation demonstrates how a small team can successfully implement complex open-source infrastructure through automation and the rigorous technical internalization of Cluster API and OpenStack. ### The Challenge of Legacy Networking - The inherited infrastructure relied on server-side routing rather than network equipment, meaning every server carried its own routing table. - Some legacy servers contained 1,997 individual routing entries, making manual management nearly impossible and preventing efficient scaling. - Initial attempts to solve this via public cloud (AWS) faced limitations, including rising costs due to exchange rates, lack of deep visibility for troubleshooting, and difficulties in disaster recovery (DR) configuration between public and on-premise environments. ### Scaling OpenStack with a Two-Person Team - Despite having only two engineers with no prior OpenStack experience, the team chose the open-source platform to maintain 100% control over the infrastructure. - The team internalized the technology by installing three different versions of OpenStack dozens of times and simulating various failure scenarios. - Automation was prioritized using Ansible and Terraform to manage the lifecycle of VMs and load balancers, enabling new instance creation in under 10 seconds. - Deep technical tuning was applied, such as modifying the source code of the Octavia load balancer to output custom log formats required for their specific monitoring needs. ### High Availability and Monitoring Strategy - To ensure reliability, the team built three independent OpenStack clusters operating in an Active-Active configuration. - This architecture allows for immediate traffic redirection if a specific cluster fails, minimizing the impact on service availability. - A comprehensive monitoring stack was implemented using Zabbix, Prometheus, Mimir, and Grafana to collect and visualize every essential metric across the private cloud. ### Managing Kubernetes with Cluster API - To replicate the convenience of Public Cloud PaaS (like EKS), the team implemented Cluster API to manage the Kubernetes lifecycle. - Cluster API treats Kubernetes clusters themselves as resources within a management cluster, allowing for standardized and rapid deployment across the private environment. - This approach ensures that developers can deploy applications without needing to distinguish between the underlying cloud providers, fulfilling the goal of "cloud-agnostic" infrastructure. ### Practical Recommendation For organizations dealing with massive technical debt or high public cloud costs, the Toss Payments model suggests that a "Private-First" hybrid approach is viable even with limited headcount. The key is to avoid proprietary black-box solutions and instead invest in the technical internalization of open-source tools like OpenStack and Cluster API, backed by a "code-as-infrastructure" philosophy to ensure scalability and reliability.

toss

Managing Thousands of API/Batch Servers (opens in new tab)

Toss Payments manages thousands of API and batch server configurations that handle trillions of won in transactions, where a single typo in a JVM setting can lead to massive financial infrastructure failure. To solve the risks associated with manual "copy-paste" workflows and configuration duplication, the team developed a sophisticated system that treats configuration as code. By implementing layered architectures and dynamic templates, they created a testable, unified environment capable of managing complex hybrid cloud setups with minimal human error. ## Overlay Architecture for Hierarchical Control * The team implemented a layered configuration system consisting of `global`, `cluster`, `phase`, and `application` levels. * Settings are resolved by priority, where lower-level layers override higher-level defaults, allowing servers to inherit common settings while maintaining specific overrides. * This structure allows the team to control environment-specific behaviors, such as disabling canary deployments in development environments, from a single centralized directory. * The directory structure maps files 1:1 to their respective layers, ensuring that naming conventions drive the CI/CD application process. ## Solving Duplication with Template Patterns * Standard YAML overlays often fail when dealing with long strings or arrays, such as `JVM_OPTION`, because changing a single value usually requires redefining the entire block. * To prevent the proliferation of nearly identical environment variables, the team introduced a template pattern using placeholders like `{{MAX_HEAP}}`. * Developers can modify specific parameters at the application layer while the core string remains defined at the global layer, significantly reducing the risk of typos. * This approach ensures that critical settings, like G1GC parameters or heap region sizes, remain consistent across the infrastructure unless explicitly changed. ## Dynamic and Conditional Configuration Logic * The system allows for "evolutionary" configurations where Python scripts can be injected to generate dynamic values, such as random JMX ports or data fetched from remote APIs. * Advanced conditional logic was added to handle complex deployment scenarios, enabling environment variables to change their values automatically based on the target cluster name (e.g., different profiles for AWS vs. IDC). * By treating configuration as a living codebase, the team can adapt to new infrastructure requirements without abandoning their core architectural principles. ## Reliable Batch Processing through Simplicity * For batch operations handling massive settlement volumes, the team prioritized "appropriate technology" and simplicity to minimize failure points. * They chose Jenkins for its low learning curve and reliability, despite its lack of native GitOps support. * To address inconsistencies in manual UI entries and varying Java versions across machines, they standardized the batch infrastructure to ensure that high-stakes financial calculations are executed in a controlled, predictable environment. The most effective way to manage large-scale infrastructure is to transition from static, duplicated configuration files to a dynamic, code-centric system. By combining an overlay architecture for hierarchy and a template pattern for granular changes, organizations can achieve the flexibility needed for hybrid clouds while maintaining the strict safety standards required for financial systems.

daangn

Easily Operating Karrot (opens in new tab)

This blog post by the Daangn (Karrot) search platform team details their journey in optimizing Elasticsearch operations on Kubernetes (ECK). While their initial migration to ECK reduced deployment times, the team faced critical latency spikes during rolling restarts due to "cold caches" and high traffic volumes. To achieve a "deploy anytime" environment, they developed a data node warm-up system to ensure nodes are performance-ready before they begin handling live search requests. ## Scaling Challenges and Operational Constraints - Over two years, Daangn's search infrastructure expanded from a single cluster to four specialized clusters, with peak traffic jumping from 1,000 to over 10,000 QPS. - The initial strategy of "avoiding peak hours" for deployments became a bottleneck, as the window for safe updates narrowed while total deployment time across all clusters exceeded six hours. - Manual monitoring became a necessity rather than an option, as engineers had to verify traffic conditions and latency graphs before and during every ArgoCD sync. ## The Hazards of Rolling Restarts in Elasticsearch - Standard Kubernetes rolling restarts are problematic for stateful systems because a "Ready" Pod does not equate to a "Performant" Pod; Elasticsearch relies heavily on memory-resident caches (page cache, query cache, field data cache). - A version update in the Elastic Operator once triggered an unintended rolling restart that caused a 60% error rate and 3-second latency spikes because new nodes had to fetch all data from disk. - When a node restarts, the cluster enters a "Yellow" state where remaining replicas must handle 100% of the traffic, creating a single point of failure and increasing the load on the surviving nodes. ## Strategy for Reliable Node Warm-up - The primary goal was to reach a state where p99 latency remains stable during restarts, regardless of whether the deployment occurs during peak traffic hours. - The solution involves a "Warm-up System" designed to pre-load frequently accessed data into the filesystem and Elasticsearch internal caches before the node is allowed to join the load balancer. - By executing representative search queries against a newly started node, the system ensures that the necessary segments are already in the page cache, preventing the disk I/O thrashing that typically follows a cold start. ## Implementation Goals - Automate the validation of node readiness beyond simple health checks to include performance readiness. - Eliminate the need for human "eyes-on-glass" monitoring during the 90-minute deployment cycles. - Maintain high availability and consistent user experience even when shards are being reallocated and replicas are temporarily unassigned. To maintain a truly resilient search platform on Kubernetes, it is critical to recognize that for stateful applications, "available" is not the same as "ready." Implementing a customized warm-up controller or logic is a recommended practice for any high-traffic Elasticsearch environment to decouple deployment schedules from traffic patterns.

line

Why did Athenz engineers take on the (opens in new tab)

Security platform engineer Jung-woo Kim details his transition from a specialized Athenz developer to a "Kubestronaut," a prestigious CNCF designation awarded to those who master the entire Kubernetes ecosystem. By systematically obtaining five distinct certifications, he argues that deep, practical knowledge of container orchestration is essential for building secure, scalable access control systems in private cloud environments. His journey demonstrates that moving beyond application-level expertise to master cluster administration and security directly improves architectural design and operational troubleshooting. ## The Kubestronaut Framework * The title is awarded by the Cloud Native Computing Foundation (CNCF) to individuals who pass five specific certification exams: CKA, CKAD, CKS, KCNA, and KCSA. * The CKA (Administrator), CKAD (Application Developer), and CKS (Security Specialist) exams are performance-based, requiring candidates to solve real-world technical problems in a live terminal environment rather than answering multiple-choice questions. * Success in these exams demands a combination of deep technical knowledge, speed, and accuracy, as practitioners must configure clusters and resolve failures under strict time constraints. * The remaining Associate-level exams (KCNA and KCSA) provide a theoretical foundation in cloud-native security and ecosystem standards. ## A Progressive Path to Technical Mastery * **CKAD (Application Developer):** The initial focus was on mastering the deployment of Athenz—an open-source auth system—ensuring it runs efficiently from a developer's perspective. Preparation involved rigorous use of tools like killer.sh to simulate high-pressure environments. * **CKA (Administrator):** To manage multi-cluster environments and understand the underlying components that make Kubernetes function, the author moved to the administrator level, gaining insight into how various services interact within the cluster. * **CKS (Security Specialist):** Given his background in security, this was the most critical and difficult stage, focusing on cluster hardening, vulnerability analysis, and implementing strict network policies to ensure the entire infrastructure remains resilient. ## Organizational Impact and Open Source Governance * Obtaining these certifications provided a clearer understanding of open-source governance, specifically how Special Interest Groups (SIGs) and pull request (PR) workflows drive massive projects like Kubernetes. * This technical depth was applied to a high-stakes project providing Athenz services in a Bare Metal as a Service (BMaaS) environment, allowing for more stable and efficient architecture design. * The learning process was supported by corporate initiatives, including access to Udemy Business for technical training and a hybrid work culture that allowed for consistent, early-morning study habits. To achieve expert-level proficiency in complex systems like Kubernetes, engineers should adopt the "Ubo-cheonri" philosophy—making slow but steady progress. Starting with even one minute of study or a single GitHub commit per day can eventually lead to mastering the highest levels of cloud-native architecture. For those managing enterprise-grade infrastructure, pursuing the Kubestronaut path is highly recommended as it transforms theoretical knowledge into a broad, practical vision for system design.

line

Connecting Thousands of LY Corporation Services (opens in new tab)

LY Corporation developed a centralized control plane using Central Dogma to manage service-to-service communication across its vast, heterogeneous infrastructure of physical machines, virtual machines, and Kubernetes clusters. By adopting the industry-standard xDS protocol, the new system resolves the interoperability and scaling limitations of their legacy platform while providing a robust GitOps-based workflow. This architecture enables the company to connect thousands of services with high reliability and sophisticated traffic control capabilities. ## Limitations of the Legacy System The previous control plane environment faced several architectural bottlenecks that hindered developer productivity and system flexibility: * **Tight Coupling:** The system was heavily dependent on a specific internal project management tool (PMC), making it difficult to support modern containerized environments like Kubernetes. * **Proprietary Schemas:** Communication relied on custom message schemas, which created interoperability issues between different clients and versions. * **Lack of Dynamic Registration:** The legacy setup could not handle dynamic endpoint registration effectively, functioning more as a static registry than a functional service mesh control plane. * **Limited Traffic Control:** It lacked the ability to perform complex routing tasks, such as canary releases or advanced client-side load balancing, across diverse infrastructures. ## Central Dogma as a Control Plane To solve these issues, the team leveraged Central Dogma, a Git-based repository service for textual configuration, to act as the foundation for a new control plane: * **xDS Protocol Integration:** The new control plane implements the industry-standard xDS protocol, ensuring seamless compatibility with Envoy and other modern data plane proxies. * **GitOps Workflow:** By utilizing Central Dogma’s mirroring features, developers can manage service configurations and traffic policies safely through Pull Requests in external Git repositories. * **High Reliability:** The system inherits Central Dogma’s native strengths, including multi-datacenter replication, high availability, and a robust authorization system. * **Schema Evolution:** The control plane automatically transforms legacy metadata into standard xDS resources, allowing for a smooth transition from old infrastructure to the new service mesh. ## Dynamic Service Discovery and Registration The architecture provides automated ways to manage service endpoints across different environments: * **Kubernetes Endpoint Plugin:** A dedicated plugin watches for changes in Kubernetes services and automatically updates the xDS resource tree in Central Dogma. * **Automated API Registration:** The system provides gRPC and HTTP APIs (e.g., `RegisterLocalityLbEndpoint`) that allow services to register themselves dynamically during the startup process. * **Advanced Traffic Features:** The new control plane supports sophisticated features like zone-aware routing, circuit breakers, automatic retries, and "slow start" mechanisms for new endpoints. ## Evolution Toward Sidecar-less Service Mesh A major focus of the project is improving the developer experience by reducing the operational overhead of the data plane: * **Sidecar-less Options:** The team is working toward providing service mesh benefits without requiring a sidecar proxy for every pod, which reduces resource consumption and simplifies debugging. * **Unified Control:** Central Dogma acts as a single source of truth for both proxy-based and proxyless service mesh configurations, ensuring consistent policy enforcement across the entire organization. For organizations managing large-scale, heterogeneous infrastructure, transitioning to an xDS-compliant control plane backed by a reliable Git-based configuration store is highly recommended. This approach balances the need for high-speed dynamic updates with the safety and auditability of GitOps, ultimately allowing for a more scalable and developer-friendly service mesh.

naver

Replacing a DB CDC replication tool that processes (opens in new tab)

Naver Pay successfully transitioned its core database replication system from a legacy tool to "ergate," a high-performance CDC (Change Data Capture) solution built on Apache Flink and Spring. This strategic overhaul was designed to improve maintainability for backend developers while resolving rigid schema dependencies that previously caused operational bottlenecks. By leveraging a modern stream-processing architecture, the system now manages massive transaction volumes with sub-second latency and enhanced reliability. ### Limitations of the Legacy System * **Maintenance Barriers:** The previous tool, mig-data, was written in pure Java by database core specialists, making it difficult for standard backend developers to maintain or extend. * **Strict Schema Dependency:** Developers were forced to follow a rigid DDL execution order (Target DB before Source DB) to avoid replication halts, complicating database operations. * **Blocking Failures:** Because the legacy system prioritized bi-directional data integrity, a single failed record could stall the entire replication pipeline for a specific shard. * **Operational Risk:** Recovery procedures were manual and restricted to a small group of specialized personnel, increasing the time-to-recovery during outages. ### Technical Architecture and Stack * **Apache Flink (LTS 2.0.0):** Selected for its high-availability, low-latency, and native Kafka integration, allowing the team to focus on replication logic rather than infrastructure. * **Kubernetes Session Mode:** Used to manage 12 concurrent jobs (6 replication, 6 verification) through a single Job Manager endpoint for streamlined monitoring and deployment. * **Hybrid Framework Approach:** The team isolated high-speed replication logic within Flink while using Spring (Kotlin) for complex recovery modules to leverage developer familiarity. * **Data Pipeline:** The system captures MySQL binlogs via `nbase-cdc`, publishes them to Kafka, and uses Flink `jdbc-sink` jobs to apply changes to Target DBs (nBase-T and Oracle). ### Three-Tier Operational Model: Replication, Verification, and Recovery * **Real-time Replication:** Processes incoming Kafka records and appends custom metadata columns (`ergate_yn`, `rpc_time`) to track the replication source and original commit time. * **Delayed Verification:** A dedicated "verifier" Flink job consumes the same Kafka topic with a 2-minute delay to check Target DB consistency against the source record. * **Secondary Logic:** To prevent false positives from rapid updates, the verifier performs a live re-query of the Source DB if a mismatch is initially detected. * **Multi-Stage Recovery:** * **Automatic Short-term:** Retries transient failures after 5 minutes. * **Automatic Long-term:** Uses batch processes to resolve persistent discrepancies. * **Manual:** Provides an admin interface for developers to trigger targeted reconciliations via API. ### Improvements in Schema Management and Performance * **DDL Independence:** By implementing query and schema caching, ergate allows Source and Target tables to be updated in any order without halting the pipeline. * **Performance Scaling:** The new system is designed to handle 10x the current peak QPS, ensuring stability even during high-traffic events like major sales or promotions. * **Metadata Tracking:** The inclusion of specific replication identifiers allows for clear distinction between automated replication and manual force-sync actions during troubleshooting. The ergate project demonstrates that a hybrid architecture—combining the high-throughput processing of Apache Flink with the robust logic handling of Spring—is highly effective for mission-critical financial systems. Organizations managing large-scale data replication should consider decoupling complex recovery logic from the main processing stream to ensure both performance and developer productivity.

line

Flexible Multi-site Architecture Designed (opens in new tab)

LINE NEXT optimized its web server infrastructure by transitioning from fragmented, manual Nginx setups to a centralized native Nginx multi-site architecture. By integrating global configurations and automating the deployment pipeline with Ansible, the team successfully reduced service launch lead times by over 80% while regaining the ability to use advanced features like GeoIP and real client IP tracking. This evolution ensures that the infrastructure can scale to support over 100 subdomains across diverse global services with high reliability and minimal manual overhead. ## Evolution of Nginx Infrastructure * **PMC-based Structure**: The initial phase relied on a Project Management Console using `rsync` via SSH; this created security risks and led to fragmented, siloed configurations that were difficult to maintain. * **Ingress Nginx Structure**: To improve speed, the team moved to Kubernetes-based Ingress using Helm charts, which automated domain and certificate settings but limited the use of native Nginx modules and complicated the retrieval of real client IP addresses. * **Native Nginx Multi-site Structure**: The current hybrid approach utilizes native Nginx managed by Ansible, combining the speed of configuration-driven setups with the flexibility to use advanced modules like GeoIP and Loki for log collection. ## Configuration Integration and Multi-site Management * **Master Configuration Extraction**: Common directives such as `timeouts`, `keep-alive` settings, and `log formats` were extracted into a master Nginx configuration file to eliminate redundancy across services. * **Hierarchical Directory Structure**: Inspired by Apache, the team adopted a `sites-available` structure where individual `server` blocks for different services (alpha, beta, production) are managed in separate files. * **Operational Efficiency**: This integrated structure allows a single Nginx instance to serve multiple sites simultaneously, significantly reducing the time required to add and deploy new service domains. ## Automated Deployment with Ansible * **Standardized Workflow**: The team replaced manual processes with Ansible playbooks that handle everything from cloning the latest configuration from Git to extracting environment-specific files. * **Safety and Validation**: The automated pipeline includes mandatory Nginx syntax verification (`nginx -t`) and process status checks to ensure stability before a deployment is finalized. * **Rolling Deployments**: To minimize service impact, updates are pushed sequentially across servers; the process automatically halts if an error is detected at any stage of the rollout. To effectively manage a rapidly expanding portfolio of global services, infrastructure teams should move toward a "configuration-as-code" model that separates common master settings from service-specific logic. Leveraging automation tools like Ansible alongside a native Nginx multi-site structure provides the necessary balance between rapid deployment and the granular control required for complex logging and security requirements.