hybrid-cloud

2 posts

toss

From Perimeter Security to Zero (opens in new tab)

Toss Payments transformed its security infrastructure from a vulnerable, single-layered legacy system into a robust "Defense in Depth" architecture spanning hybrid IDC and AWS environments. By integrating advanced perimeter defense, internal server monitoring, and container runtime security, the team established a comprehensive framework that prioritizes visibility and continuous verification. This four-year journey demonstrates that modern security requires moving beyond simple boundary protection toward a proactive, multi-layered strategy that assumes breaches can occur. ### Perimeter Defense and SSL/TLS Visibility * Addressed the critical visibility gap in legacy systems by implementing dedicated SSL/TLS decryption tools, allowing the team to analyze encrypted traffic for hidden malicious payloads. * Established a hybrid security architecture using a combination of physical DDoS protection, IPS, and WAF in IDC environments, complemented by AWS WAF and AI-based GuardDuty in the cloud. * Developed a collaborative merchant response process that moves beyond simple IP blocking; the system automatically detects malicious traffic from partners and provides them with detailed vulnerability reports and remediation guides (e.g., specific SQL injection points). ### Internal Network Security and "Assume Breach" Monitoring * Implemented **Wazuh**, an open-source security platform, in IDC environments to monitor lateral movement, collect centralized logs, and perform file integrity checks across diverse operating systems. * Leveraged **AWS GuardDuty** for intelligent threat detection in the cloud, focusing on malware scanning for EC2 instances and monitoring for suspicious process activities. * Established automated detection for privilege escalation and unauthorized access to sensitive system files, such as tracking instances where root privileges are obtained to modify the `/etc/passwd` file. ### Container Runtime Security as the Final Defense * Adopted **Falco**, a CNCF-hosted runtime security tool, to protect Kubernetes environments by monitoring system calls (syscalls) in real-time. * Configured specific security rules to detect "container escape" attempts, unauthorized access to sensitive files like `/etc/shadow`, and the execution of new or suspicious binaries within running containers. * Integrated **Falco Sidekick** to manage security events efficiently, ensuring that anomalous behaviors at the container level are instantly routed to the security team for response. ### Zero Trust and Continuous Verification * Shifted toward a Zero Trust model for the internal work network to ensure that all users and devices are continuously verified regardless of their location. * Focused on implementing dynamic access control and the principle of least privilege to minimize the potential impact of credential theft or device compromise. Organizations operating in hybrid cloud environments should move away from relying on a single perimeter and instead adopt a multi-layered defense strategy. True security resilience is achieved by gaining deep visibility into encrypted traffic and maintaining granular monitoring at the server and container levels to intercept threats that inevitably bypass initial defenses.

toss

The story of how I destroyed (opens in new tab)

Toss Payments modernized its inherited legacy infrastructure by building an OpenStack-based private cloud to operate alongside public cloud providers in an Active-Active hybrid configuration. By overcoming extreme technical debt—including servers burdened with nearly 2,000 manual routing entries—the team achieved a cloud-agnostic deployment environment that ensures high availability and cost efficiency. The transformation demonstrates how a small team can successfully implement complex open-source infrastructure through automation and the rigorous technical internalization of Cluster API and OpenStack. ### The Challenge of Legacy Networking - The inherited infrastructure relied on server-side routing rather than network equipment, meaning every server carried its own routing table. - Some legacy servers contained 1,997 individual routing entries, making manual management nearly impossible and preventing efficient scaling. - Initial attempts to solve this via public cloud (AWS) faced limitations, including rising costs due to exchange rates, lack of deep visibility for troubleshooting, and difficulties in disaster recovery (DR) configuration between public and on-premise environments. ### Scaling OpenStack with a Two-Person Team - Despite having only two engineers with no prior OpenStack experience, the team chose the open-source platform to maintain 100% control over the infrastructure. - The team internalized the technology by installing three different versions of OpenStack dozens of times and simulating various failure scenarios. - Automation was prioritized using Ansible and Terraform to manage the lifecycle of VMs and load balancers, enabling new instance creation in under 10 seconds. - Deep technical tuning was applied, such as modifying the source code of the Octavia load balancer to output custom log formats required for their specific monitoring needs. ### High Availability and Monitoring Strategy - To ensure reliability, the team built three independent OpenStack clusters operating in an Active-Active configuration. - This architecture allows for immediate traffic redirection if a specific cluster fails, minimizing the impact on service availability. - A comprehensive monitoring stack was implemented using Zabbix, Prometheus, Mimir, and Grafana to collect and visualize every essential metric across the private cloud. ### Managing Kubernetes with Cluster API - To replicate the convenience of Public Cloud PaaS (like EKS), the team implemented Cluster API to manage the Kubernetes lifecycle. - Cluster API treats Kubernetes clusters themselves as resources within a management cluster, allowing for standardized and rapid deployment across the private environment. - This approach ensures that developers can deploy applications without needing to distinguish between the underlying cloud providers, fulfilling the goal of "cloud-agnostic" infrastructure. ### Practical Recommendation For organizations dealing with massive technical debt or high public cloud costs, the Toss Payments model suggests that a "Private-First" hybrid approach is viable even with limited headcount. The key is to avoid proprietary black-box solutions and instead invest in the technical internalization of open-source tools like OpenStack and Cluster API, backed by a "code-as-infrastructure" philosophy to ensure scalability and reliability.