central-dogma

2 posts

line

Introducing a New A/B (opens in new tab)

LY Corporation has developed an advanced A/B testing system that moves beyond simple random assignment to support dynamic user segmentation. By integrating a dedicated targeting system with a high-performance experiment assigner, the platform allows for precise experiments tailored to specific user characteristics and behaviors. This architecture enables data-driven decisions that are more relevant to localized or specialized user groups rather than relying on broad averages. ## Limitations of Traditional A/B Testing * General A/B test systems typically rely on random assignment, such as applying a hash function to a user ID (`hash(id) % 2`), which is simple and cost-effective. * While random assignment reduces selection bias, it is insufficient for hypotheses that only apply to specific cohorts, such as "iOS users living in Osaka." * Advanced systems solve this by shifting from general testing across an entire user base to personalized testing for specific segments. ## Architecture of the Targeting System * The system processes massive datasets including user information, mobile device data, and application activity stored in HDFS. * Apache Spark is used to execute complex conditional operations—such as unions, intersections, and subtractions—to refine user segments. * Segment data is written to Object Storage and then cached in Redis using a `{user_id}-{segment_id}` key format to ensure low-latency lookups during live requests. ## A/B Test Management and Assignment * The system utilizes "Central Dogma" as a configuration repository where operators and administrators define experiment parameters. * A Test Group Assigner orchestrates the process: when a client makes a request, the assigner retrieves experiment info and checks the user's segment membership in Redis. * Once a user is assigned to a specific group (e.g., Test Group 1), the system serves the corresponding content and logs the event to a data store for dashboard visualization and analysis. ## Strategic Use Cases and Future Plans * **Content Recommendation:** Testing different Machine Learning models to see which performs better for a specific user demographic. * **Targeted Incentives:** Limiting shopping discount experiments to "light users," as coupons may not significantly change the behavior of "heavy users." * **Onboarding Optimization:** Restricting UI tests to new users only, ensuring that existing users' experiences remain uninterrupted. * **Platform Expansion:** Future goals include building a unified admin interface for the entire lifecycle of an experiment and expanding the system to cover all services within LY Corporation. For organizations looking to optimize user experience, transitioning from random assignment to dynamic segmentation is essential for high-precision product development. Ensuring that segment data is cached in a high-performance store like Redis is critical to maintaining low latency when serving experimental variations in real-time.

line

Connecting Thousands of LY Corporation Services (opens in new tab)

LY Corporation developed a centralized control plane using Central Dogma to manage service-to-service communication across its vast, heterogeneous infrastructure of physical machines, virtual machines, and Kubernetes clusters. By adopting the industry-standard xDS protocol, the new system resolves the interoperability and scaling limitations of their legacy platform while providing a robust GitOps-based workflow. This architecture enables the company to connect thousands of services with high reliability and sophisticated traffic control capabilities. ## Limitations of the Legacy System The previous control plane environment faced several architectural bottlenecks that hindered developer productivity and system flexibility: * **Tight Coupling:** The system was heavily dependent on a specific internal project management tool (PMC), making it difficult to support modern containerized environments like Kubernetes. * **Proprietary Schemas:** Communication relied on custom message schemas, which created interoperability issues between different clients and versions. * **Lack of Dynamic Registration:** The legacy setup could not handle dynamic endpoint registration effectively, functioning more as a static registry than a functional service mesh control plane. * **Limited Traffic Control:** It lacked the ability to perform complex routing tasks, such as canary releases or advanced client-side load balancing, across diverse infrastructures. ## Central Dogma as a Control Plane To solve these issues, the team leveraged Central Dogma, a Git-based repository service for textual configuration, to act as the foundation for a new control plane: * **xDS Protocol Integration:** The new control plane implements the industry-standard xDS protocol, ensuring seamless compatibility with Envoy and other modern data plane proxies. * **GitOps Workflow:** By utilizing Central Dogma’s mirroring features, developers can manage service configurations and traffic policies safely through Pull Requests in external Git repositories. * **High Reliability:** The system inherits Central Dogma’s native strengths, including multi-datacenter replication, high availability, and a robust authorization system. * **Schema Evolution:** The control plane automatically transforms legacy metadata into standard xDS resources, allowing for a smooth transition from old infrastructure to the new service mesh. ## Dynamic Service Discovery and Registration The architecture provides automated ways to manage service endpoints across different environments: * **Kubernetes Endpoint Plugin:** A dedicated plugin watches for changes in Kubernetes services and automatically updates the xDS resource tree in Central Dogma. * **Automated API Registration:** The system provides gRPC and HTTP APIs (e.g., `RegisterLocalityLbEndpoint`) that allow services to register themselves dynamically during the startup process. * **Advanced Traffic Features:** The new control plane supports sophisticated features like zone-aware routing, circuit breakers, automatic retries, and "slow start" mechanisms for new endpoints. ## Evolution Toward Sidecar-less Service Mesh A major focus of the project is improving the developer experience by reducing the operational overhead of the data plane: * **Sidecar-less Options:** The team is working toward providing service mesh benefits without requiring a sidecar proxy for every pod, which reduces resource consumption and simplifies debugging. * **Unified Control:** Central Dogma acts as a single source of truth for both proxy-based and proxyless service mesh configurations, ensuring consistent policy enforcement across the entire organization. For organizations managing large-scale, heterogeneous infrastructure, transitioning to an xDS-compliant control plane backed by a reliable Git-based configuration store is highly recommended. This approach balances the need for high-speed dynamic updates with the safety and auditability of GitOps, ultimately allowing for a more scalable and developer-friendly service mesh.