Netflix / microservices

4 posts

netflix

How Temporal Powers Reliable Cloud Operations at Netflix | by Netflix Technology Blog | Dec, 2025 | Netflix TechBlog (opens in new tab)

Netflix has significantly enhanced the reliability of its global continuous delivery platform, Spinnaker, by adopting Temporal for durable execution of cloud operations. By migrating away from a fragile, polling-based orchestration model between its internal services, the engineering team successfully reduced transient deployment failures from 4% to a remarkable 0.0001%. This shift has allowed developers to write complex, long-running operational logic as standard code while the underlying platform handles state persistence and fault recovery. ### Limitations of Legacy Orchestration * **The Polling Bottleneck:** Originally, Netflix's orchestration engine (Orca) communicated with its cloud interface (Clouddriver) via a synchronous POST request followed by continuous polling of a GET endpoint to track task status. * **State Fragility:** Clouddriver utilized an internal orchestration engine that relied on in-memory state or volatile Redis storage, meaning if a Clouddriver instance crashed mid-operation, the deployment state was often lost, leading to "zombie" tasks or failed deployments. * **Manual Error Handling:** Developers had to manually implement complex retry logic, exponential backoffs, and state checkpointing for every cloud operation, which was both error-prone and difficult to maintain. ### Transitioning to Durable Execution with Temporal * **Abstraction of Failures:** Temporal provides a "Durable Execution" platform where the state of a workflow—including local variables and thread stacks—is automatically persisted. This allows code to run "as if failures don’t exist," as the system can resume exactly where it left off after a process crash or network interruption. * **Workflows and Activities:** Netflix re-architected cloud operations into Temporal Workflows (orchestration logic) and Activities (idempotent units of work like calling an AWS API). This separation ensures that the orchestration logic remains deterministic while external side effects are handled reliably. * **Eliminating Polling:** By using Temporal’s signaling and long-running execution capabilities, Netflix moved away from the heavy overhead of thousands of services polling for status updates, replacing them with a push-based, event-driven model. ### Impact on Cloud Operations * **Dramatic Reliability Gains:** The most significant outcome was the near-elimination of transient failures, moving from a 4% failure rate to 0.0001%, ensuring that critical updates to the Open Connect CDN and Live streaming infrastructure are executed with high confidence. * **Developer Productivity:** Using Temporal’s SDKs, Netflix engineers can now write standard Java or Go code to define complex deployment strategies (like canary releases or blue-green deployments) without building custom state machines or management layers. * **Operational Visibility:** Temporal provides a native UI and history audit log for every workflow, giving operators deep visibility into exactly which step of a deployment failed and why, along with the ability to retry specific failed steps manually if necessary. For organizations managing complex, distributed cloud infrastructure, adopting a durable execution framework like Temporal is highly recommended. It moves the burden of state management and fault tolerance from the application layer to the platform, allowing engineers to focus on business logic rather than the mechanics of distributed systems failure.

netflix

Netflix Live Origin. Xiaomei Liu, Joseph Lynch, Chris Newton | by Netflix Technology Blog | Dec, 2025 | Netflix TechBlog (opens in new tab)

The Netflix Live Origin is a specialized, multi-tenant microservice designed to bridge the gap between cloud-based live streaming pipelines and the Open Connect content delivery network. By operating as an intelligent broker, it manages content selection across redundant regional pipelines to ensure that only valid, high-quality segments are distributed to client devices. This architecture allows Netflix to achieve high resilience and stream integrity through server-side failover and deterministic segment selection. ### Multi-Pipeline and Multi-Region Awareness * The origin server mitigates common live streaming defects, such as missing segments, timing discontinuities, and short segments containing missing video or audio samples. * It leverages independent, redundant streaming pipelines across different AWS regions to ensure high availability; if one pipeline fails or produces a defective segment, the origin selects a valid candidate from an alternate path. * Implementation of epoch locking at the cloud encoder level allows the origin to interchangeably select segments from various pipelines. * The system uses lightweight media inspection at the packager level to generate metadata, which the origin then uses to perform deterministic candidate selection. ### Stream Distribution and Protocol Integration * The service operates on AWS EC2 instances and utilizes standard HTTP protocol features for communication. * Upstream packagers use HTTP PUT requests to push segments into storage at specific URLs, while the downstream Open Connect network retrieves them via GET requests. * The architecture is optimized for a manifest design that uses segment templates and constant segment durations, which reduces the need for frequent manifest refreshes. ### Open Connect Streaming Optimization * While Netflix’s Open Connect Appliances (OCAs) were originally optimized for VOD, the Live Origin extends nginx proxy-caching functionality to meet live-specific requirements. * OCAs are provided with Live Event Configuration data, including Availability Start Times and initial segment numbers, to determine the legitimate range of segments for an event. * This predictive modeling allows the CDN to reject requests for objects outside the valid range immediately, reducing unnecessary traffic and load on the origin. By decoupling the live streaming pipeline from the distribution network through this specialized origin layer, Netflix can maintain a high level of fault tolerance and stream stability. This approach minimizes client-side complexity by handling failovers and segment selection on the server side, ensuring a seamless experience for viewers of live events.

netflix

How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet Scale | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix has developed a Real-Time Distributed Graph (RDG) to unify member interaction data across its expanding business verticals, including streaming, live events, and mobile gaming. By transitioning from siloed microservice data to a graph-based model, the company can perform low-latency, relationship-centric queries that were previously hindered by expensive manual joins and data fragmentation. The resulting system enables Netflix to track user journeys across various devices and platforms in real-time, providing a foundation for deeper personalization and pattern detection. ### Challenges of Data Isolation in Microservices * While Netflix’s microservices architecture facilitates independent scaling and service decomposition, it inherently leads to data isolation where each service manages its own storage. * Data scientists and engineers previously had to "stitch" together disparate data from various databases and the central data warehouse, which was a slow and manual process. * The RDG moves away from table-based models to a relationship-centric model, allowing for efficient "hops" across nodes without the need for complex denormalization. * This flexibility allows the system to adapt to new business entities (like live sports or games) without requiring massive schema re-architectures. ### Real-Time Ingestion and Normalization * The ingestion layer is designed to capture events from diverse upstream sources, including Change Data Capture (CDC) from databases and request/response logs. * Netflix utilizes its internal data pipeline, Keystone, to funnel these high-volume event streams into the processing framework. * The system must handle "Internet scale" data, ensuring that events from millions of members are captured as they happen to maintain an up-to-date view of the graph. ### Stream Processing with Apache Flink * Netflix uses Apache Flink as the core stream processing engine to handle the transformation of raw events into graph entities. * Incoming data undergoes normalization to ensure a standardized format, regardless of which microservice or business vertical the data originated from. * The pipeline performs data enrichment, joining incoming streams with auxiliary metadata to provide a comprehensive context for each interaction. * The final step of the processing layer involves mapping these enriched events into a graph structure of nodes (entities) and edges (relationships), which are then emitted to the system's storage layer. ### Practical Conclusion Organizations operating with a highly decoupled microservices architecture should consider a graph-based ingestion strategy to overcome the limitations of data silos. By leveraging stream processing tools like Apache Flink to build a real-time graph, engineering teams can provide stakeholders with the ability to discover hidden relationships and cross-domain insights that are often lost in traditional data warehouses.

netflix

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix’s Muse platform has evolved from a simple dashboard into a high-scale Online Analytical Processing (OLAP) system that processes trillions of rows to provide creative insights for promotional media. To meet growing demands for complex audience affinity analysis and advanced filtering, the engineering team modernized the data serving layer by moving beyond basic batch pipelines. By integrating HyperLogLog sketches for approximate counting and leveraging in-memory precomputed aggregates, the system now delivers low-latency performance and high data accuracy at an immense scale. ### Approximate Counting with HyperLogLog (HLL) Sketches To track metrics like unique impressions and qualified plays without the massive overhead of comparing billions of profile IDs, Muse utilizes the Apache Datasketches library. * The system trades a small margin of error (approximately 0.8% with a logK of 17) for significant gains in processing speed and memory efficiency. * Sketches are built during Druid ingestion using the HLLSketchBuild aggregator with rollup enabled to reduce data volume. * In the Spark ETL process, all-time aggregates are maintained by merging new daily HLL sketches into existing ones using the `hll_union` function. ### Utilizing Hollow for In-Memory Aggregates To reduce the query load on the Druid cluster, Netflix uses Hollow, an internal open-source tool designed for high-density, near-cache data sets. * Muse stores precomputed, all-time aggregates—such as lifetime impressions per asset—within Hollow’s in-memory data structures. * When a user requests "all-time" data, the application retrieves the results from the Hollow cache instead of forcing Druid to scan months or years of historical segments. * This approach significantly lowers latency for the most common queries and frees up Druid resources for more complex, dynamic filtering tasks. ### Optimizing the Druid Data Layer Efficient data retrieval from Druid is critical for supporting the application’s advanced grouping and filtering capabilities. * The team transitioned from hash-based partitioning to range-based partitioning on frequently filtered dimensions like `video_id` to improve data locality and pruning. * Background compaction tasks are utilized to merge small segments into larger ones, reducing metadata overhead and improving scan speeds across the cluster. * Specific tuning was applied to the Druid broker and historical nodes, including adjusting processing threads and buffer sizes to handle the high-concurrency demands of the Muse UI. ### Validation and Data Accuracy Because the move to HLL sketches introduces approximation, the team implemented rigorous validation processes to ensure the data remained actionable. * Internal debugging tools were developed to compare results from the new architecture against the "ground truth" provided by legacy batch systems. * Continuous monitoring ensures that HLL error rates remain within the expected 1–2% range and that data remains consistent across different time grains. For organizations building large-scale OLAP applications, the Muse architecture demonstrates that performance bottlenecks can often be solved by combining approximate data structures with specialized in-memory caches to offload heavy computations from the primary database.