Earlier this week, more than 2,000 payments leaders gathered at the Merchant Risk Council (MRC) Vegas 2026 conference to discuss new fraud patterns, authentication strategies, and agentic commerce. One theme emerged: fraud has become more automated and increasingly difficult to…
Netflix has developed a Real-Time Distributed Graph (RDG) to unify member interaction data across its expanding business verticals, including streaming, live events, and mobile gaming. By transitioning from siloed microservice data to a graph-based model, the company can perform low-latency, relationship-centric queries that were previously hindered by expensive manual joins and data fragmentation. The resulting system enables Netflix to track user journeys across various devices and platforms in real-time, providing a foundation for deeper personalization and pattern detection.
### Challenges of Data Isolation in Microservices
* While Netflix’s microservices architecture facilitates independent scaling and service decomposition, it inherently leads to data isolation where each service manages its own storage.
* Data scientists and engineers previously had to "stitch" together disparate data from various databases and the central data warehouse, which was a slow and manual process.
* The RDG moves away from table-based models to a relationship-centric model, allowing for efficient "hops" across nodes without the need for complex denormalization.
* This flexibility allows the system to adapt to new business entities (like live sports or games) without requiring massive schema re-architectures.
### Real-Time Ingestion and Normalization
* The ingestion layer is designed to capture events from diverse upstream sources, including Change Data Capture (CDC) from databases and request/response logs.
* Netflix utilizes its internal data pipeline, Keystone, to funnel these high-volume event streams into the processing framework.
* The system must handle "Internet scale" data, ensuring that events from millions of members are captured as they happen to maintain an up-to-date view of the graph.
### Stream Processing with Apache Flink
* Netflix uses Apache Flink as the core stream processing engine to handle the transformation of raw events into graph entities.
* Incoming data undergoes normalization to ensure a standardized format, regardless of which microservice or business vertical the data originated from.
* The pipeline performs data enrichment, joining incoming streams with auxiliary metadata to provide a comprehensive context for each interaction.
* The final step of the processing layer involves mapping these enriched events into a graph structure of nodes (entities) and edges (relationships), which are then emitted to the system's storage layer.
### Practical Conclusion
Organizations operating with a highly decoupled microservices architecture should consider a graph-based ingestion strategy to overcome the limitations of data silos. By leveraging stream processing tools like Apache Flink to build a real-time graph, engineering teams can provide stakeholders with the ability to discover hidden relationships and cross-domain insights that are often lost in traditional data warehouses.
Netflix manages the massive surge of concurrent users during live events by utilizing a hybrid strategy of prefetching and real-time broadcasting to deliver synchronized recommendations. By decoupling data delivery from the live trigger, the system avoids the "thundering herd" effect that would otherwise overwhelm cloud infrastructure during record-breaking broadcasts. This architecture ensures that millions of global devices receive timely updates and visual cues without requiring linear, inefficient scaling of compute resources.
### The Constraint Optimization Problem
To maintain a seamless experience, Netflix engineers balance three primary technical constraints: time to update, request throughput, and compute cardinality.
* **Time:** The specific duration required to coordinate and push a recommendation update to the entire global fleet.
* **Throughput:** The maximum capacity of cloud services to handle incoming requests without service degradation.
* **Cardinality:** The variety and complexity of unique requests necessary to serve personalized updates to different user segments.
### Two-Phase Recommendation Delivery
The system splits the delivery process into two distinct stages to smooth out traffic spikes and ensure high availability.
* **Prefetching Phase:** While members browse the app normally before an event, the system downloads materialized recommendations, metadata, and artwork into the device's local cache.
* **Broadcasting Phase:** When the event begins, a low-cardinality "at least once" message is broadcast to all connected devices, triggering them to display the already-cached content instantaneously.
* **Traffic Smoothing:** This approach eliminates the need for massive, real-time data fetches at the moment of kickoff, distributing the heavy lifting of data transfer over a longer period.
### Live State Management and UI Synchronization
A dedicated Live State Management (LSM) system tracks event schedules in real time to ensure the user interface stays perfectly in sync with the production.
* **Dynamic Adjustments:** If a live event is delayed or ends early, the LSM adjusts the broadcast triggers to preserve accuracy and prevent "spoilers" or dead links.
* **Visual Cues:** The UI utilizes "Live" badging and dynamic artwork transitions to signal urgency and guide users toward the stream.
* **Frictionless Playback:** For members already on a title’s detail page, the system can trigger an automatic transition into the live player the moment the broadcast begins, reducing navigation latency.
To support global-scale live events, technical teams should prioritize edge-heavy strategies that pre-position assets on client devices. By shifting from a reactive request-response model to a proactive prefetch-and-trigger model, platforms can maintain high performance and reliability even during the most significant traffic peaks.