microservices

8 posts

netflix

How Temporal Powers Reliable Cloud Operations at Netflix | by Netflix Technology Blog | Dec, 2025 | Netflix TechBlog (opens in new tab)

Netflix has significantly enhanced the reliability of its global continuous delivery platform, Spinnaker, by adopting Temporal for durable execution of cloud operations. By migrating away from a fragile, polling-based orchestration model between its internal services, the engineering team successfully reduced transient deployment failures from 4% to a remarkable 0.0001%. This shift has allowed developers to write complex, long-running operational logic as standard code while the underlying platform handles state persistence and fault recovery. ### Limitations of Legacy Orchestration * **The Polling Bottleneck:** Originally, Netflix's orchestration engine (Orca) communicated with its cloud interface (Clouddriver) via a synchronous POST request followed by continuous polling of a GET endpoint to track task status. * **State Fragility:** Clouddriver utilized an internal orchestration engine that relied on in-memory state or volatile Redis storage, meaning if a Clouddriver instance crashed mid-operation, the deployment state was often lost, leading to "zombie" tasks or failed deployments. * **Manual Error Handling:** Developers had to manually implement complex retry logic, exponential backoffs, and state checkpointing for every cloud operation, which was both error-prone and difficult to maintain. ### Transitioning to Durable Execution with Temporal * **Abstraction of Failures:** Temporal provides a "Durable Execution" platform where the state of a workflow—including local variables and thread stacks—is automatically persisted. This allows code to run "as if failures don’t exist," as the system can resume exactly where it left off after a process crash or network interruption. * **Workflows and Activities:** Netflix re-architected cloud operations into Temporal Workflows (orchestration logic) and Activities (idempotent units of work like calling an AWS API). This separation ensures that the orchestration logic remains deterministic while external side effects are handled reliably. * **Eliminating Polling:** By using Temporal’s signaling and long-running execution capabilities, Netflix moved away from the heavy overhead of thousands of services polling for status updates, replacing them with a push-based, event-driven model. ### Impact on Cloud Operations * **Dramatic Reliability Gains:** The most significant outcome was the near-elimination of transient failures, moving from a 4% failure rate to 0.0001%, ensuring that critical updates to the Open Connect CDN and Live streaming infrastructure are executed with high confidence. * **Developer Productivity:** Using Temporal’s SDKs, Netflix engineers can now write standard Java or Go code to define complex deployment strategies (like canary releases or blue-green deployments) without building custom state machines or management layers. * **Operational Visibility:** Temporal provides a native UI and history audit log for every workflow, giving operators deep visibility into exactly which step of a deployment failed and why, along with the ability to retry specific failed steps manually if necessary. For organizations managing complex, distributed cloud infrastructure, adopting a durable execution framework like Temporal is highly recommended. It moves the burden of state management and fault tolerance from the application layer to the platform, allowing engineers to focus on business logic rather than the mechanics of distributed systems failure.

netflix

Netflix Live Origin. Xiaomei Liu, Joseph Lynch, Chris Newton | by Netflix Technology Blog | Dec, 2025 | Netflix TechBlog (opens in new tab)

The Netflix Live Origin is a specialized, multi-tenant microservice designed to bridge the gap between cloud-based live streaming pipelines and the Open Connect content delivery network. By operating as an intelligent broker, it manages content selection across redundant regional pipelines to ensure that only valid, high-quality segments are distributed to client devices. This architecture allows Netflix to achieve high resilience and stream integrity through server-side failover and deterministic segment selection. ### Multi-Pipeline and Multi-Region Awareness * The origin server mitigates common live streaming defects, such as missing segments, timing discontinuities, and short segments containing missing video or audio samples. * It leverages independent, redundant streaming pipelines across different AWS regions to ensure high availability; if one pipeline fails or produces a defective segment, the origin selects a valid candidate from an alternate path. * Implementation of epoch locking at the cloud encoder level allows the origin to interchangeably select segments from various pipelines. * The system uses lightweight media inspection at the packager level to generate metadata, which the origin then uses to perform deterministic candidate selection. ### Stream Distribution and Protocol Integration * The service operates on AWS EC2 instances and utilizes standard HTTP protocol features for communication. * Upstream packagers use HTTP PUT requests to push segments into storage at specific URLs, while the downstream Open Connect network retrieves them via GET requests. * The architecture is optimized for a manifest design that uses segment templates and constant segment durations, which reduces the need for frequent manifest refreshes. ### Open Connect Streaming Optimization * While Netflix’s Open Connect Appliances (OCAs) were originally optimized for VOD, the Live Origin extends nginx proxy-caching functionality to meet live-specific requirements. * OCAs are provided with Live Event Configuration data, including Availability Start Times and initial segment numbers, to determine the legitimate range of segments for an event. * This predictive modeling allows the CDN to reject requests for objects outside the valid range immediately, reducing unnecessary traffic and load on the origin. By decoupling the live streaming pipeline from the distribution network through this specialized origin layer, Netflix can maintain a high level of fault tolerance and stream stability. This approach minimizes client-side complexity by handling failovers and segment selection on the server side, ensuring a seamless experience for viewers of live events.

line

Why did Athenz engineers take on the (opens in new tab)

Security platform engineer Jung-woo Kim details his transition from a specialized Athenz developer to a "Kubestronaut," a prestigious CNCF designation awarded to those who master the entire Kubernetes ecosystem. By systematically obtaining five distinct certifications, he argues that deep, practical knowledge of container orchestration is essential for building secure, scalable access control systems in private cloud environments. His journey demonstrates that moving beyond application-level expertise to master cluster administration and security directly improves architectural design and operational troubleshooting. ## The Kubestronaut Framework * The title is awarded by the Cloud Native Computing Foundation (CNCF) to individuals who pass five specific certification exams: CKA, CKAD, CKS, KCNA, and KCSA. * The CKA (Administrator), CKAD (Application Developer), and CKS (Security Specialist) exams are performance-based, requiring candidates to solve real-world technical problems in a live terminal environment rather than answering multiple-choice questions. * Success in these exams demands a combination of deep technical knowledge, speed, and accuracy, as practitioners must configure clusters and resolve failures under strict time constraints. * The remaining Associate-level exams (KCNA and KCSA) provide a theoretical foundation in cloud-native security and ecosystem standards. ## A Progressive Path to Technical Mastery * **CKAD (Application Developer):** The initial focus was on mastering the deployment of Athenz—an open-source auth system—ensuring it runs efficiently from a developer's perspective. Preparation involved rigorous use of tools like killer.sh to simulate high-pressure environments. * **CKA (Administrator):** To manage multi-cluster environments and understand the underlying components that make Kubernetes function, the author moved to the administrator level, gaining insight into how various services interact within the cluster. * **CKS (Security Specialist):** Given his background in security, this was the most critical and difficult stage, focusing on cluster hardening, vulnerability analysis, and implementing strict network policies to ensure the entire infrastructure remains resilient. ## Organizational Impact and Open Source Governance * Obtaining these certifications provided a clearer understanding of open-source governance, specifically how Special Interest Groups (SIGs) and pull request (PR) workflows drive massive projects like Kubernetes. * This technical depth was applied to a high-stakes project providing Athenz services in a Bare Metal as a Service (BMaaS) environment, allowing for more stable and efficient architecture design. * The learning process was supported by corporate initiatives, including access to Udemy Business for technical training and a hybrid work culture that allowed for consistent, early-morning study habits. To achieve expert-level proficiency in complex systems like Kubernetes, engineers should adopt the "Ubo-cheonri" philosophy—making slow but steady progress. Starting with even one minute of study or a single GitHub commit per day can eventually lead to mastering the highest levels of cloud-native architecture. For those managing enterprise-grade infrastructure, pursuing the Kubestronaut path is highly recommended as it transforms theoretical knowledge into a broad, practical vision for system design.

toss

From Legacy Payment Ledger to Scalable (opens in new tab)

Toss Payments successfully modernized a 20-year-old legacy payment ledger by transitioning to a decoupled, MySQL-based architecture designed for high scalability and consistency. By implementing strategies like INSERT-only immutability and event-driven domain isolation, they overcame structural limitations such as the inability to handle split payments. Ultimately, the project demonstrates that robust system design must be paired with resilient operational recovery mechanisms to manage the complexities of large-scale financial migrations. ### Legacy Ledger Challenges * **Inconsistent Schemas:** Different payment methods used entirely different table structures; for instance, a table named `REFUND` unexpectedly contained only account transfer data rather than all refund types. * **Domain Coupling:** Multiple domains (settlement, accounting, and payments) shared the same tables and columns, meaning a single schema change required impact analysis across several teams. * **Structural Limits:** A rigid 1:1 relationship between a payment and its method prevented the implementation of modern features like split payments or "Dutch pay" models. ### New Ledger Architecture * **Data Immutability:** The system shifted from updating existing rows to an **INSERT-only** principle, ensuring a reliable audit trail and preventing database deadlocks. * **Event-Driven Decoupling:** Instead of direct database access, the system uses Kafka to publish payment events, allowing independent domains to consume data without tight coupling. * **Payment-Approval Separation:** By separating the "Payment" (the transaction intent) from the "Approval" (the specific financial method), the system now supports multiple payment methods per transaction. ### Safe Migration and Data Integrity * **Asynchronous Mirroring:** To maintain zero downtime, data was initially written to the legacy system and then asynchronously loaded into the new MySQL ledger. * **Resource Tuning:** Developers used dedicated migration servers within the same AWS Availability Zone to minimize latency and implemented **Bulk Inserts** to handle hundreds of millions of rows efficiently. * **Verification Batches:** A separate batch process ran every five minutes against a Read-Only (RO) database to identify and correct any data gaps caused by asynchronous processing failures. ### Operational Resilience and Incident Response * **Query Optimization:** During a load spike, the MySQL optimizer chose "Full Scans" over indexes; the team resolved this by implementing SQL hints and utilizing a 5-version Docker image history for rapid rollbacks. * **Network Cancellation:** To handle timeouts between Toss and external card issuers, the system uses specific logic to automatically send cancellation requests and synchronize states. * **Timeout Standardization:** Discrepancies between microservices were resolved by calculating the maximum processing time of approval servers and aligning all upstream timeout settings to prevent merchant response mismatches. * **Reliable Event Delivery:** While using the **Outbox pattern** for events, the team added log-based recovery (Elasticsearch and local disk) and idempotency keys in event headers to handle both missing and duplicate messages. For organizations tackling significant technical debt, this transition highlights that initial design is only half the battle. True system reliability comes from building "self-healing" structures—such as automated correction batches and standardized timeout chains—that can survive the unpredictable nature of live production environments.

netflix

How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet Scale | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix has developed a Real-Time Distributed Graph (RDG) to unify member interaction data across its expanding business verticals, including streaming, live events, and mobile gaming. By transitioning from siloed microservice data to a graph-based model, the company can perform low-latency, relationship-centric queries that were previously hindered by expensive manual joins and data fragmentation. The resulting system enables Netflix to track user journeys across various devices and platforms in real-time, providing a foundation for deeper personalization and pattern detection. ### Challenges of Data Isolation in Microservices * While Netflix’s microservices architecture facilitates independent scaling and service decomposition, it inherently leads to data isolation where each service manages its own storage. * Data scientists and engineers previously had to "stitch" together disparate data from various databases and the central data warehouse, which was a slow and manual process. * The RDG moves away from table-based models to a relationship-centric model, allowing for efficient "hops" across nodes without the need for complex denormalization. * This flexibility allows the system to adapt to new business entities (like live sports or games) without requiring massive schema re-architectures. ### Real-Time Ingestion and Normalization * The ingestion layer is designed to capture events from diverse upstream sources, including Change Data Capture (CDC) from databases and request/response logs. * Netflix utilizes its internal data pipeline, Keystone, to funnel these high-volume event streams into the processing framework. * The system must handle "Internet scale" data, ensuring that events from millions of members are captured as they happen to maintain an up-to-date view of the graph. ### Stream Processing with Apache Flink * Netflix uses Apache Flink as the core stream processing engine to handle the transformation of raw events into graph entities. * Incoming data undergoes normalization to ensure a standardized format, regardless of which microservice or business vertical the data originated from. * The pipeline performs data enrichment, joining incoming streams with auxiliary metadata to provide a comprehensive context for each interaction. * The final step of the processing layer involves mapping these enriched events into a graph structure of nodes (entities) and edges (relationships), which are then emitted to the system's storage layer. ### Practical Conclusion Organizations operating with a highly decoupled microservices architecture should consider a graph-based ingestion strategy to overcome the limitations of data silos. By leveraging stream processing tools like Apache Flink to build a real-time graph, engineering teams can provide stakeholders with the ability to discover hidden relationships and cross-domain insights that are often lost in traditional data warehouses.

line

Pushsphere: The Secret to (opens in new tab)

LINE developed Pushsphere to overcome the inherent instability and rate-limiting challenges of delivering high-volume push notifications via providers like APNs and FCM. By implementing a sophisticated gateway architecture rather than relying on naive retry logic, the system ensures reliable delivery even during massive traffic spikes or regional emergencies. This approach has successfully stabilized the messaging pipeline, drastically reducing operational overhead and system-wide failures. ## Limitations of Standard Push Architectures * External push providers are frequently unstable, exhibiting misbehaving instances, sudden disconnections, and unpredictable timeouts. * Naive retry strategies often lead to "retry storms," which quickly exhaust rate-limit quotas and result in HTTP 429 (Too Many Requests) errors. * At massive scales, manual management of hundreds of server connections becomes impossible, necessitating automated decisions on when to abandon or switch between faulty nodes. ## Unified Gateway Design and High-Performance Transport * Pushsphere provides a single entry point for all push platforms, abstracting the complexities of mTLS for Apple and OAuth 2.0 for Firebase. * The system is built on the Armeria microservice framework and utilizes Netty for high-performance, non-blocking communication within the Java Virtual Machine. * The architecture includes a client library and gateway server that support zone-aware routing, ensuring low latency and efficient traffic distribution across data centers. ## Intelligent Retry and Load Balancing Strategies * The "retry-aware" load balancer uses a Round Robin base strategy but is designed to skip previously attempted endpoints during a retry cycle to avoid repeated failures on faulty nodes. * Quota-aware logic monitors rate limits in real-time, preventing the system from retrying endpoints that are nearing their capacity. * These smarter traffic distribution rules balance high delivery success rates with the preservation of provider quotas, preventing service-wide blocking. ## Resilient Endpoint Management via Circuit Breakers * Pushsphere assigns a dedicated circuit breaker to every endpoint to report success and failure rates continuously. * When a circuit opens due to frequent failures, the unhealthy endpoint is immediately removed from the active pool and replaced with a fresh candidate from a DNS-refreshed pool. * This automated replacement mechanism maintains a consistent pool of healthy endpoints, allowing the system to remain stable without manual intervention during hardware or network degradations. Pushsphere has transformed LINE's notification infrastructure, reducing annual on-call alerts from over 30 to just four, despite implementing stricter monitoring thresholds. For developers managing high-volume messaging services, adopting a gateway-based approach with automated circuit breaking and quota awareness is a proven path to achieving carrier-grade reliability.

netflix

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix’s Muse platform has evolved from a simple dashboard into a high-scale Online Analytical Processing (OLAP) system that processes trillions of rows to provide creative insights for promotional media. To meet growing demands for complex audience affinity analysis and advanced filtering, the engineering team modernized the data serving layer by moving beyond basic batch pipelines. By integrating HyperLogLog sketches for approximate counting and leveraging in-memory precomputed aggregates, the system now delivers low-latency performance and high data accuracy at an immense scale. ### Approximate Counting with HyperLogLog (HLL) Sketches To track metrics like unique impressions and qualified plays without the massive overhead of comparing billions of profile IDs, Muse utilizes the Apache Datasketches library. * The system trades a small margin of error (approximately 0.8% with a logK of 17) for significant gains in processing speed and memory efficiency. * Sketches are built during Druid ingestion using the HLLSketchBuild aggregator with rollup enabled to reduce data volume. * In the Spark ETL process, all-time aggregates are maintained by merging new daily HLL sketches into existing ones using the `hll_union` function. ### Utilizing Hollow for In-Memory Aggregates To reduce the query load on the Druid cluster, Netflix uses Hollow, an internal open-source tool designed for high-density, near-cache data sets. * Muse stores precomputed, all-time aggregates—such as lifetime impressions per asset—within Hollow’s in-memory data structures. * When a user requests "all-time" data, the application retrieves the results from the Hollow cache instead of forcing Druid to scan months or years of historical segments. * This approach significantly lowers latency for the most common queries and frees up Druid resources for more complex, dynamic filtering tasks. ### Optimizing the Druid Data Layer Efficient data retrieval from Druid is critical for supporting the application’s advanced grouping and filtering capabilities. * The team transitioned from hash-based partitioning to range-based partitioning on frequently filtered dimensions like `video_id` to improve data locality and pruning. * Background compaction tasks are utilized to merge small segments into larger ones, reducing metadata overhead and improving scan speeds across the cluster. * Specific tuning was applied to the Druid broker and historical nodes, including adjusting processing threads and buffer sizes to handle the high-concurrency demands of the Muse UI. ### Validation and Data Accuracy Because the move to HLL sketches introduces approximation, the team implemented rigorous validation processes to ensure the data remained actionable. * Internal debugging tools were developed to compare results from the new architecture against the "ground truth" provided by legacy batch systems. * Continuous monitoring ensures that HLL error rates remain within the expected 1–2% range and that data remains consistent across different time grains. For organizations building large-scale OLAP applications, the Muse architecture demonstrates that performance bottlenecks can often be solved by combining approximate data structures with specialized in-memory caches to offload heavy computations from the primary database.

coupang

Coupang SCM Workflow: Developing (opens in new tab)

Coupang has developed an internal SCM Workflow platform to streamline the complex data and operational needs of its Supply Chain Management team. By implementing low-code and no-code functionalities, the platform enables developers, data scientists, and business analysts to build data pipelines and launch services without the traditional bottlenecks of manual development. ### Addressing Inefficiencies in SCM Data Management * The SCM team manages a massive network of suppliers and fulfillment centers (FCs) where demand forecasting and inventory distribution require constant data feedback. * Traditionally, non-technical stakeholders like business analysts (BAs) relied heavily on developers to build or modify data pipelines, leading to high communication costs and slower response times to changing business requirements. * The new platform aims to simplify the complexity found in traditional tools like Jenkins, Airflow, and Jupyter Notebooks, providing a unified interface for data creation and visualization. ### Democratizing Access with the No-code Data Builder * The "Data Builder" allows users to perform data queries, extraction, and system integration through a visual interface rather than writing backend code. * It provides seamless access to a wide array of data sources used across Coupang, including Redshift, Hive, Presto, Aurora, MySQL, Elasticsearch, and S3. * Users can construct workflows by creating "nodes" for specific tasks—such as extracting inventory data from Hive or calculating transfer quantities—and linking them together to automate complex decisions like inter-center product transfers. ### Expanding Capabilities through Low-code Service Building * The platform functions as a "Service Builder," allowing users to expand domains and launch simple services without building entirely new infrastructure from scratch. * This approach enables developers to focus on high-level algorithm development while allowing data scientists to apply and test new models directly within the production environment. * By reducing the need for code changes to reflect new requirements, the platform significantly increases the agility of the SCM pipeline. Organizations managing complex, data-driven ecosystems can significantly reduce operational friction by adopting low-code/no-code platforms. Empowering non-technical stakeholders to handle data processing and service integration not only accelerates innovation but also allows engineering resources to be redirected toward core architectural challenges.