high-availability

4 posts

toss

How I Tole Down Our Legacy (opens in new tab)

Toss Payments modernized its inherited legacy infrastructure by building an OpenStack-based private cloud to operate alongside public cloud providers in an Active-Active hybrid configuration. By overcoming extreme technical debt—including servers burdened with nearly 2,000 manual routing entries—the team achieved a cloud-agnostic deployment environment that ensures high availability and cost efficiency. The transformation demonstrates how a small team can successfully implement complex open-source infrastructure through automation and the rigorous technical internalization of Cluster API and OpenStack. ### The Challenge of Legacy Networking - The inherited infrastructure relied on server-side routing rather than network equipment, meaning every server carried its own routing table. - Some legacy servers contained 1,997 individual routing entries, making manual management nearly impossible and preventing efficient scaling. - Initial attempts to solve this via public cloud (AWS) faced limitations, including rising costs due to exchange rates, lack of deep visibility for troubleshooting, and difficulties in disaster recovery (DR) configuration between public and on-premise environments. ### Scaling OpenStack with a Two-Person Team - Despite having only two engineers with no prior OpenStack experience, the team chose the open-source platform to maintain 100% control over the infrastructure. - The team internalized the technology by installing three different versions of OpenStack dozens of times and simulating various failure scenarios. - Automation was prioritized using Ansible and Terraform to manage the lifecycle of VMs and load balancers, enabling new instance creation in under 10 seconds. - Deep technical tuning was applied, such as modifying the source code of the Octavia load balancer to output custom log formats required for their specific monitoring needs. ### High Availability and Monitoring Strategy - To ensure reliability, the team built three independent OpenStack clusters operating in an Active-Active configuration. - This architecture allows for immediate traffic redirection if a specific cluster fails, minimizing the impact on service availability. - A comprehensive monitoring stack was implemented using Zabbix, Prometheus, Mimir, and Grafana to collect and visualize every essential metric across the private cloud. ### Managing Kubernetes with Cluster API - To replicate the convenience of Public Cloud PaaS (like EKS), the team implemented Cluster API to manage the Kubernetes lifecycle. - Cluster API treats Kubernetes clusters themselves as resources within a management cluster, allowing for standardized and rapid deployment across the private environment. - This approach ensures that developers can deploy applications without needing to distinguish between the underlying cloud providers, fulfilling the goal of "cloud-agnostic" infrastructure. ### Practical Recommendation For organizations dealing with massive technical debt or high public cloud costs, the Toss Payments model suggests that a "Private-First" hybrid approach is viable even with limited headcount. The key is to avoid proprietary black-box solutions and instead invest in the technical internalization of open-source tools like OpenStack and Cluster API, backed by a "code-as-infrastructure" philosophy to ensure scalability and reliability.

line

Replacing the Payment System DB Handling (opens in new tab)

The LINE Billing Platform successfully migrated its large-scale payment database from Nbase-T to Vitess to handle high-traffic global transactions. While initially exploring gRPC for its performance reputation, the team transitioned to the MySQL protocol to ensure stability and reduce CPU overhead within their Java-based environment. This implementation demonstrates how Vitess can manage complex sharding requirements while maintaining high availability through automated recovery tools. ### Protocol Selection and Implementation - The team initially attempted to use the gRPC protocol but encountered `http2: frame too large` errors and significant CPU overhead during performance testing. - Manual mapping of query results to Java objects proved cumbersome with the Vitess gRPC client, leading to a shift toward the more mature and recommended MySQL protocol. - Using the MySQL protocol allowed the team to leverage standard database drivers while benefiting from Vitess's routing capabilities via VTGate. ### Keyspace Architecture and Data Routing - The system utilizes a dual-keyspace strategy: a "Global Keyspace" for unsharded metadata and a "Service Keyspace" for sharded transaction data. - The Global Keyspace manages sharding keys using a "sequence" table type to ensure unique, auto-incrementing identifiers across the platform. - The Service Keyspace is partitioned into $N$ shards using a hash-based Vindex, which distributes coin balances and transaction history. - VTGate automatically routes queries to the correct shard by analyzing the sharding key in the `WHERE` clause or `INSERT` statement, minimizing cross-shard overhead. ### MySQL Compatibility and Transaction Logic - Vitess maintains `REPEATABLE READ` isolation for single-shard transactions, while multi-shard transactions default to `READ COMMITTED`. - Advanced features like Two-Phase Commit (2PC) are available for handling distributed transactions across multiple shards. - Query execution plans are analyzed using `VEXPLAIN` and `VTEXPLAIN`, often managed through the VTAdmin web interface for better visibility. - Certain limitations apply, such as temporary tables only being supported in unsharded keyspaces and specific unsupported SQL cases documented in the Vitess core. ### Automated Operations and Monitoring - The team employs VTOrc (based on Orchestrator) to automatically detect and repair database failures, such as unreachable primaries or replication stops. - Monitoring is centralized via Prometheus, which scrapes metrics from VTOrc, VTGate, and VTTablet components at dedicated ports (e.g., 16000). - Real-time alerts are routed through Slack and email, using `tablet_alias` to specifically identify which MySQL node or VTTablet is experiencing issues. - A web-based recovery dashboard provides a history of automated fixes, allowing operators to track the health of the cluster over time. For organizations migrating high-traffic legacy systems to a cloud-native sharding solution, prioritizing the MySQL protocol over gRPC is recommended for better compatibility with existing application frameworks and reduced operational complexity.

line

Milvus: Building a (opens in new tab)

LINE VOOM transitioned its recommendation system from a batch-based offline process to a real-time infrastructure to solve critical content freshness issues. By adopting Milvus, an open-source vector database, the team enabled the immediate indexing and searching of new video content as soon as it is uploaded. This implementation ensures that time-sensitive posts are recommended to users without the previous 24-hour delay, significantly enhancing user engagement. ### Limitations of the Legacy Recommendation System * The original system relied on daily offline batch processing for embedding generation and similarity searches. * New content, such as holiday greetings or trending sports clips, suffered from a "lack of immediacy," often taking up to a full day to appear in user feeds. * To improve user experience, the team needed to shift from offline candidate pools to an online system capable of real-time Approximate Nearest Neighbor (ANN) searches. ### Selecting Milvus as the Vector Database * The team evaluated Milvus and Qdrant based on performance, open-source status, and on-premise compatibility. * Milvus was selected due to its superior performance, handling 2,406 requests per second compared to Qdrant's 326, with lower query latency (1ms vs 4ms). * Key architectural advantages of Milvus included the separation of storage and computing, support for both stream and batch inserts, and a diverse range of supported in-memory index types. ### Reliability Verification via Chaos Testing * Given the complexity of Milvus clusters, the team performed chaos testing by intentionally injecting failures like pod kills and scaling events. * Tests revealed critical vulnerabilities: killing the `Querycoord` led to collection release and search failure, while losing the `Etcd` quorum caused total metadata loss. * These findings highlighted the need for robust high-availability (HA) configurations to prevent service interruptions during component failures. ### High Availability (HA) Implementation Strategies * **Collection-Level HA:** To prevent search failures during coordinator issues, the team implemented a dual-writing system where embeddings are recorded in two separate collections simultaneously. * **Alias Switching:** Client applications use an "alias" to reference collections; if the primary collection becomes unavailable, the system instantly switches the alias to the backup collection to minimize downtime. * **Coordinator-Level HA:** To eliminate single points of failure, coordinators (such as `Indexcoord`) were configured in an Active-Standby mode, ensuring a backup is always ready to take over management tasks. To successfully deploy a large-scale real-time recommendation engine, it is critical to select a vector database that decouples storage from compute and to implement multi-layered high-availability strategies, such as dual-collection writing and active-standby coordinators, to ensure production stability.

line

Replacing the Database of a Payment System (opens in new tab)

The LINE Billing Platform team recently migrated its core payment database from Nbase-T to Vitess to address rising licensing costs while maintaining the high availability required for financial transactions. After a rigorous Proof of Concept (PoC) evaluating Apache ShardingSphere, TiDB, and Vitess, the team selected Vitess for its mature sharding capabilities and its ability to provide a stable, scalable environment on bare-metal infrastructure. This migration ensures the platform can handle large-scale traffic efficiently without the financial burden of proprietary license fees. ### Evaluation of Alternative Sharding Solutions Before settling on Vitess, the team analyzed other prominent distributed database technologies to determine their fit for a high-stakes payment system: * **Apache ShardingSphere:** While it offers flexible Proxy and JDBC layers, it was excluded because it requires significant manual effort for data resharding and rebalancing. The management overhead for implementing shard-key logic across various components (API, batch, admin) was deemed too high. * **TiDB:** This MySQL-compatible distributed database uses a decoupled architecture consisting of TiDB (SQL layer), PD (metadata management), and TiKV (row-based storage). Its primary advantage is automatic rebalancing and the lack of a required shard key, which significantly reduces DBA operational costs. * **Nbase-T:** The legacy system provided the highest performance efficiency per resource unit; however, the shift from a free to a paid licensing model necessitated the move to an open-source alternative. ### Vitess Architecture and Core Components Vitess was chosen for its proven track record at companies like YouTube and GitHub, offering a robust abstraction layer that makes a clustered database appear as a single instance to the application. The system relies on several specialized components: * **VTGate:** A proxy server that routes queries to the correct VTTablet, manages distributed transactions, and hides the physical topology of the database from the application. * **VTTablet:** A sidecar process running alongside each MySQL instance that manages query execution, data replication, and connection pooling. * **VTorc and Topology Server:** High availability is managed by VTorc (an automated failover tool), while metadata regarding shard locations and node status is synchronized via a topology server using ZooKeeper or etcd. ### PoC Performance and Environment Setup The team conducted performance testing by simulating real payment API scenarios (a mix of reads and writes) on standardized hardware (8vCPU, 16GB RAM). * **Comparison Metrics:** The tests focused on Transactions Per Second (TPS) and resource utilization as thread counts increased. * **Infrastructure Strategy:** Because payment systems cannot tolerate even brief failover delays, the team opted for a bare-metal deployment rather than a containerized one to ensure maximum stability and performance. * **Resource Efficiency:** While Nbase-T showed the best raw efficiency, Vitess demonstrated the necessary scalability and management features required to replace the legacy system effectively within the new cost constraints. ### Practical Recommendation For organizations managing critical core systems that require horizontal scaling without proprietary lock-in, Vitess is a highly recommended solution. While it requires a deep understanding of its various components (like VTGate and VTTablet) and careful configuration of its topology server, the trade-off is a mature, cloud-native-ready architecture that supports massive scale and automated failover on both bare-metal and cloud environments.