jenkins

2 posts

toss

Managing Thousands of API/Batch Servers (opens in new tab)

Toss Payments manages thousands of API and batch server configurations that handle trillions of won in transactions, where a single typo in a JVM setting can lead to massive financial infrastructure failure. To solve the risks associated with manual "copy-paste" workflows and configuration duplication, the team developed a sophisticated system that treats configuration as code. By implementing layered architectures and dynamic templates, they created a testable, unified environment capable of managing complex hybrid cloud setups with minimal human error. ## Overlay Architecture for Hierarchical Control * The team implemented a layered configuration system consisting of `global`, `cluster`, `phase`, and `application` levels. * Settings are resolved by priority, where lower-level layers override higher-level defaults, allowing servers to inherit common settings while maintaining specific overrides. * This structure allows the team to control environment-specific behaviors, such as disabling canary deployments in development environments, from a single centralized directory. * The directory structure maps files 1:1 to their respective layers, ensuring that naming conventions drive the CI/CD application process. ## Solving Duplication with Template Patterns * Standard YAML overlays often fail when dealing with long strings or arrays, such as `JVM_OPTION`, because changing a single value usually requires redefining the entire block. * To prevent the proliferation of nearly identical environment variables, the team introduced a template pattern using placeholders like `{{MAX_HEAP}}`. * Developers can modify specific parameters at the application layer while the core string remains defined at the global layer, significantly reducing the risk of typos. * This approach ensures that critical settings, like G1GC parameters or heap region sizes, remain consistent across the infrastructure unless explicitly changed. ## Dynamic and Conditional Configuration Logic * The system allows for "evolutionary" configurations where Python scripts can be injected to generate dynamic values, such as random JMX ports or data fetched from remote APIs. * Advanced conditional logic was added to handle complex deployment scenarios, enabling environment variables to change their values automatically based on the target cluster name (e.g., different profiles for AWS vs. IDC). * By treating configuration as a living codebase, the team can adapt to new infrastructure requirements without abandoning their core architectural principles. ## Reliable Batch Processing through Simplicity * For batch operations handling massive settlement volumes, the team prioritized "appropriate technology" and simplicity to minimize failure points. * They chose Jenkins for its low learning curve and reliability, despite its lack of native GitOps support. * To address inconsistencies in manual UI entries and varying Java versions across machines, they standardized the batch infrastructure to ensure that high-stakes financial calculations are executed in a controlled, predictable environment. The most effective way to manage large-scale infrastructure is to transition from static, duplicated configuration files to a dynamic, code-centric system. By combining an overlay architecture for hierarchy and a template pattern for granular changes, organizations can achieve the flexibility needed for hybrid clouds while maintaining the strict safety standards required for financial systems.

toss

Enhancing Data Literacy for (opens in new tab)

Toss’s Business Data Team addressed the lack of centralized insights into their business customer (BC) base by building a standardized Single Source of Truth (SSOT) data mart and an iterative Monthly BC Report. This initiative successfully unified fragmented data across business units like Shopping, Ads, and Pay, enabling consistent data-driven decision-making and significantly raising the organization's overall data literacy. ## Establishing a Single Source of Truth (SSOT) - Addressed the inefficiency of fragmented data across various departments by integrating disparate datasets into a unified, enterprise-wide data mart. - Standardized the definition of an "active" Business Customer through cross-functional communication and a deep understanding of how revenue and costs are generated in each service domain. - Eliminated communication overhead by ensuring all stakeholders used a single, verified dataset rather than conflicting numbers from different business silos. ## Designing the Monthly BC Report for Actionable Insights - Visualized monthly revenue trends by segmenting customers into specific tiers and categories, such as New, Churn, and Retained, to identify where growth or attrition was occurring. - Implemented Cohort Retention metrics by business unit to measure platform stickiness and help teams understand which services were most effective at retaining business users. - Provided granular Raw Data lists for high-revenue customers showing significant growth or churn, allowing operational teams to identify immediate action points. - Refined reporting metrics through in-depth interviews with Product Owners (POs), Sales Leaders, and Domain Heads to ensure the data addressed real-world business questions. ## Technical Architecture and Validation - Built the core SSOT data mart using Airflow for scalable data orchestration and workflow management. - Leveraged Jenkins to handle the batch processing and deployment of the specific data layers required for the reporting environment. - Integrated Tableau with SQL-based fact aggregations to automate the monthly refresh of charts and dashboards, ensuring the report remains a "living" document. - Conducted "collective intelligence" verification meetings to check metric definitions, units, and visual clarity, ensuring the final report was intuitive for all users. ## Driving Organizational Change and Data Literacy - Sparked a surge in data demand, leading to follow-up projects such as daily real-time tracking, Cross-Domain Activation analysis, and deeper funnel analysis for BC registrations. - Transitioned the organizational culture from passive data consumption to active utilization, with diverse roles—including Strategy Managers and Business Marketers—now using BC data to prove their business impact. - Maintained an iterative approach where the report format evolves every month based on stakeholder feedback, ensuring the data remains relevant to the shifting needs of the business. Establishing a centralized data culture requires more than just technical infrastructure; it requires a commitment to iterative feedback and clear communication. By moving from fragmented silos to a unified reporting standard, data analysts can transform from simple "number providers" into strategic partners who drive company-wide literacy and growth.