State of Routing in Model Serving -- 3 Listen Share By Nipun Kumar, Rajat Shah, Peter Chng Introduction This is the first blog post in a multi-part series that shares technical insights into how our ML model serving infrastructure powers several personalized experiences at scale…
Khayyam Guliyev Duarte Nunes Ming Chen Justin Jaffray As Datadog continues to scale, the volume, complexity, and cardinality of the metrics we ingest and store steadily grow by orders of magnitude. This growth pushes the boundaries of our core timeseries database—the internal sy…
The LINE Billing Platform successfully migrated its large-scale payment database from Nbase-T to Vitess to handle high-traffic global transactions. While initially exploring gRPC for its performance reputation, the team transitioned to the MySQL protocol to ensure stability and reduce CPU overhead within their Java-based environment. This implementation demonstrates how Vitess can manage complex sharding requirements while maintaining high availability through automated recovery tools.
### Protocol Selection and Implementation
- The team initially attempted to use the gRPC protocol but encountered `http2: frame too large` errors and significant CPU overhead during performance testing.
- Manual mapping of query results to Java objects proved cumbersome with the Vitess gRPC client, leading to a shift toward the more mature and recommended MySQL protocol.
- Using the MySQL protocol allowed the team to leverage standard database drivers while benefiting from Vitess's routing capabilities via VTGate.
### Keyspace Architecture and Data Routing
- The system utilizes a dual-keyspace strategy: a "Global Keyspace" for unsharded metadata and a "Service Keyspace" for sharded transaction data.
- The Global Keyspace manages sharding keys using a "sequence" table type to ensure unique, auto-incrementing identifiers across the platform.
- The Service Keyspace is partitioned into $N$ shards using a hash-based Vindex, which distributes coin balances and transaction history.
- VTGate automatically routes queries to the correct shard by analyzing the sharding key in the `WHERE` clause or `INSERT` statement, minimizing cross-shard overhead.
### MySQL Compatibility and Transaction Logic
- Vitess maintains `REPEATABLE READ` isolation for single-shard transactions, while multi-shard transactions default to `READ COMMITTED`.
- Advanced features like Two-Phase Commit (2PC) are available for handling distributed transactions across multiple shards.
- Query execution plans are analyzed using `VEXPLAIN` and `VTEXPLAIN`, often managed through the VTAdmin web interface for better visibility.
- Certain limitations apply, such as temporary tables only being supported in unsharded keyspaces and specific unsupported SQL cases documented in the Vitess core.
### Automated Operations and Monitoring
- The team employs VTOrc (based on Orchestrator) to automatically detect and repair database failures, such as unreachable primaries or replication stops.
- Monitoring is centralized via Prometheus, which scrapes metrics from VTOrc, VTGate, and VTTablet components at dedicated ports (e.g., 16000).
- Real-time alerts are routed through Slack and email, using `tablet_alias` to specifically identify which MySQL node or VTTablet is experiencing issues.
- A web-based recovery dashboard provides a history of automated fixes, allowing operators to track the health of the cluster over time.
For organizations migrating high-traffic legacy systems to a cloud-native sharding solution, prioritizing the MySQL protocol over gRPC is recommended for better compatibility with existing application frameworks and reduced operational complexity.
LY Corporation’s ABC Studio developed a specialized retail Merchant system by leveraging Domain-Driven Design (DDD) to overcome the functional limitations of a legacy food-delivery infrastructure. The project demonstrates that the primary value of DDD lies not just in technical implementation, but in aligning organizational structures and team responsibilities with domain boundaries. By focusing on the roles and responsibilities of the system rather than just the code, the team created a scalable platform capable of supporting diverse consumer interfaces.
### Redefining the Retail Domain
* The legacy system treated retail items like restaurant entries, creating friction for specialized retail services; the new system was built to be a standalone platform.
* The team narrowed the domain focus to five core areas: Shop, Item, Category, Inventory, and Order.
* Sales-specific logic, such as coupons and promotions, was delegated to external "Consumer Platforms," allowing the Merchant system to serve as a high-performance information provider.
### Clean Architecture and Modular Composition
* The system utilizes Clean Architecture to ensure domain entities remain independent of external frameworks, which also provided a manageable learning curve for new team members.
* Services are split into two distinct modules: "API" modules for receiving external requests and "Engine" modules for processing business logic.
* Communication between these modules is handled asynchronously via gRPC and Apache Kafka, using the Decaton library to increase throughput while maintaining a low partition count.
* The architecture prioritizes eventual consistency, allowing for high responsiveness and scalability across the platform.
### Global Collaboration and Conway’s Law
* Development was split between teams in Korea (Core Domain) and Japan (System Integration and BFF), requiring a shared understanding of domain boundaries.
* Architectural Decision Records (ADR) were implemented to document critical decisions and prevent "knowledge drift" during long-term collaboration.
* The organizational structure was intentionally designed to mirror the system architecture, with specific teams (Core, Link, BFF, and Merchant Link) assigned to distinct domain layers.
* This alignment, reflecting Conway’s Law, ensures that changes to external consumer platforms have minimal impact on the stable core domain logic.
Successful DDD adoption requires moving beyond technical patterns like hexagonal architecture and focusing on establishing a shared understanding of roles across the organization. By structuring teams to match domain boundaries, companies can build resilient systems where the core business logic remains protected even as the external service ecosystem evolves.
Fabiana Scala Tali Gutman Like many other organizations, Datadog has long relied on the convenience of a large, shared relational database across multiple teams. This pattern is pervasive in the industry because it works well for many workloads—and continues to work well even at…
Arun Parthiban Sesh Nalla Cecilia Wat-Kim It is a trusted premise of software engineering that we build large systems from smaller components, each of which can be designed and tested with a high degree of confidence. Some design problems, though, only become evident at the syst…
Joe McCourt Sagar Mohite Austin Lai How do we surface the rich stories hidden within our users' observability data? We can use percentiles to communicate performance for a specific percentage of cases—but for the full shape of performance, we use distribution metrics. These metr…
Laurent Bernaille David Lentz This story began when a routine update to one of our critical services caused a rise in errors. It looked like a simple issue—logs pointed to DNS and our metrics indicated that the impact to users was very low. But weeks later, our engineers were st…