jvm

2 posts

toss

Managing Thousands of API/Batch Servers (opens in new tab)

Toss Payments manages thousands of API and batch server configurations that handle trillions of won in transactions, where a single typo in a JVM setting can lead to massive financial infrastructure failure. To solve the risks associated with manual "copy-paste" workflows and configuration duplication, the team developed a sophisticated system that treats configuration as code. By implementing layered architectures and dynamic templates, they created a testable, unified environment capable of managing complex hybrid cloud setups with minimal human error. ## Overlay Architecture for Hierarchical Control * The team implemented a layered configuration system consisting of `global`, `cluster`, `phase`, and `application` levels. * Settings are resolved by priority, where lower-level layers override higher-level defaults, allowing servers to inherit common settings while maintaining specific overrides. * This structure allows the team to control environment-specific behaviors, such as disabling canary deployments in development environments, from a single centralized directory. * The directory structure maps files 1:1 to their respective layers, ensuring that naming conventions drive the CI/CD application process. ## Solving Duplication with Template Patterns * Standard YAML overlays often fail when dealing with long strings or arrays, such as `JVM_OPTION`, because changing a single value usually requires redefining the entire block. * To prevent the proliferation of nearly identical environment variables, the team introduced a template pattern using placeholders like `{{MAX_HEAP}}`. * Developers can modify specific parameters at the application layer while the core string remains defined at the global layer, significantly reducing the risk of typos. * This approach ensures that critical settings, like G1GC parameters or heap region sizes, remain consistent across the infrastructure unless explicitly changed. ## Dynamic and Conditional Configuration Logic * The system allows for "evolutionary" configurations where Python scripts can be injected to generate dynamic values, such as random JMX ports or data fetched from remote APIs. * Advanced conditional logic was added to handle complex deployment scenarios, enabling environment variables to change their values automatically based on the target cluster name (e.g., different profiles for AWS vs. IDC). * By treating configuration as a living codebase, the team can adapt to new infrastructure requirements without abandoning their core architectural principles. ## Reliable Batch Processing through Simplicity * For batch operations handling massive settlement volumes, the team prioritized "appropriate technology" and simplicity to minimize failure points. * They chose Jenkins for its low learning curve and reliability, despite its lack of native GitOps support. * To address inconsistencies in manual UI entries and varying Java versions across machines, they standardized the batch infrastructure to ensure that high-stakes financial calculations are executed in a controlled, predictable environment. The most effective way to manage large-scale infrastructure is to transition from static, duplicated configuration files to a dynamic, code-centric system. By combining an overlay architecture for hierarchy and a template pattern for granular changes, organizations can achieve the flexibility needed for hybrid clouds while maintaining the strict safety standards required for financial systems.

naver

Beyond the Side Effects of API- (opens in new tab)

JVM applications often suffer from initial latency spikes because the Just-In-Time (JIT) compiler requires a "warm-up" period to optimize frequently executed code into machine language. While traditional strategies rely on simulated API calls to trigger this optimization, these methods often introduce side effects like data pollution, log noise, and increased maintenance overhead. This new approach advocates for a library-centric warm-up that targets core execution paths and dependencies directly, ensuring high performance from the first real request without the risks of full-scale API simulation. ### Limitations of Traditional API-Based Warm-up * **Data and State Pollution:** Simulated API calls can inadvertently trigger database writes, send notifications, or pollute analytics data, requiring complex logic to bypass these side effects. * **Maintenance Burden:** As business logic and API signatures change, developers must constantly update the warm-up scripts or "dummy" requests to match the current application state. * **Operational Risk:** Relying on external dependencies or complex internal services during the warm-up phase can lead to deployment failures if the mock environment is not perfectly aligned with production. ### The Library-Centric Warm-up Strategy * **Targeted Optimization:** Instead of hitting the entry-point controllers, the focus shifts to warming up heavy third-party libraries and internal utility classes (e.g., JSON parsers, encryption modules, and DB drivers). * **Internal Execution Path:** By directly invoking methods within the application's service or infrastructure layer during the startup phase, the JIT compiler can reach "Tier 4" (C2) optimization for critical code blocks. * **Decoupled Logic:** Because the warm-up targets underlying libraries rather than specific business endpoints, the logic remains stable even when the high-level API changes. ### Implementation and Performance Verification * **Reflection and Hooks:** The implementation uses application startup hooks to execute intensive code paths, ensuring the JVM is "hot" before the load balancer begins directing traffic to the instance. * **JIT Compilation Monitoring:** Success is measured by tracking the number of JIT-compiled methods and the time taken to reach a stable state, specifically targeting the reduction of "cold" execution time. * **Latency Improvements:** Empirical data shows a significant reduction in P99 latency during the first few minutes of deployment, as the most CPU-intensive library functions are already pre-optimized. ### Advantages and Practical Constraints * **Safer Deployments:** Removing the need for simulated network requests makes the deployment process more robust and prevents accidental side effects in downstream systems. * **Granular Control:** Developers can selectively warm up only the most performance-sensitive parts of the application, saving startup time compared to a full-system simulation. * **Incomplete Path Coverage:** A primary limitation is that library-only warming may miss specific branch optimizations that occur only during full end-to-end request processing. To achieve the best balance between safety and performance, engineering teams should prioritize warming up shared infrastructure libraries and high-overhead utilities. While it may not cover 100% of the application's execution paths, a library-based approach provides a more maintainable and lower-risk foundation for JVM performance tuning than traditional request-based methods.