orchestration

3 posts

netflix

How Temporal Powers Reliable Cloud Operations at Netflix | by Netflix Technology Blog | Dec, 2025 | Netflix TechBlog (opens in new tab)

Netflix has significantly enhanced the reliability of its global continuous delivery platform, Spinnaker, by adopting Temporal for durable execution of cloud operations. By migrating away from a fragile, polling-based orchestration model between its internal services, the engineering team successfully reduced transient deployment failures from 4% to a remarkable 0.0001%. This shift has allowed developers to write complex, long-running operational logic as standard code while the underlying platform handles state persistence and fault recovery. ### Limitations of Legacy Orchestration * **The Polling Bottleneck:** Originally, Netflix's orchestration engine (Orca) communicated with its cloud interface (Clouddriver) via a synchronous POST request followed by continuous polling of a GET endpoint to track task status. * **State Fragility:** Clouddriver utilized an internal orchestration engine that relied on in-memory state or volatile Redis storage, meaning if a Clouddriver instance crashed mid-operation, the deployment state was often lost, leading to "zombie" tasks or failed deployments. * **Manual Error Handling:** Developers had to manually implement complex retry logic, exponential backoffs, and state checkpointing for every cloud operation, which was both error-prone and difficult to maintain. ### Transitioning to Durable Execution with Temporal * **Abstraction of Failures:** Temporal provides a "Durable Execution" platform where the state of a workflow—including local variables and thread stacks—is automatically persisted. This allows code to run "as if failures don’t exist," as the system can resume exactly where it left off after a process crash or network interruption. * **Workflows and Activities:** Netflix re-architected cloud operations into Temporal Workflows (orchestration logic) and Activities (idempotent units of work like calling an AWS API). This separation ensures that the orchestration logic remains deterministic while external side effects are handled reliably. * **Eliminating Polling:** By using Temporal’s signaling and long-running execution capabilities, Netflix moved away from the heavy overhead of thousands of services polling for status updates, replacing them with a push-based, event-driven model. ### Impact on Cloud Operations * **Dramatic Reliability Gains:** The most significant outcome was the near-elimination of transient failures, moving from a 4% failure rate to 0.0001%, ensuring that critical updates to the Open Connect CDN and Live streaming infrastructure are executed with high confidence. * **Developer Productivity:** Using Temporal’s SDKs, Netflix engineers can now write standard Java or Go code to define complex deployment strategies (like canary releases or blue-green deployments) without building custom state machines or management layers. * **Operational Visibility:** Temporal provides a native UI and history audit log for every workflow, giving operators deep visibility into exactly which step of a deployment failed and why, along with the ability to retry specific failed steps manually if necessary. For organizations managing complex, distributed cloud infrastructure, adopting a durable execution framework like Temporal is highly recommended. It moves the burden of state management and fault tolerance from the application layer to the platform, allowing engineers to focus on business logic rather than the mechanics of distributed systems failure.

naver

VLOps:Event-driven MLOps & Omni-Evaluator (opens in new tab)

Naver’s VLOps framework introduces an event-driven approach to MLOps, designed to overcome the rigidity of traditional pipeline-based systems like Kubeflow. By shifting from a monolithic pipeline structure to a system governed by autonomous sensors and typed messages, Naver has achieved a highly decoupled and scalable environment for multimodal AI development. This architecture allows for seamless functional expansion and cross-cloud compatibility, ultimately simplifying the transition from model training to large-scale evaluation and deployment. ### Event-Driven MLOps Architecture * Operations such as training, evaluation, and deployment are defined as "Typed Messages," which serve as the primary units of communication within the system. * An "Event Sensor" acts as the core logic hub, autonomously detecting these messages and triggering the corresponding tasks without requiring a predefined, end-to-end pipeline. * The system eliminates the need for complex version management of entire pipelines, as new features can be integrated simply by adding new message types. * This approach ensures loose coupling between evaluation and deployment systems, facilitating easier maintenance and infrastructure flexibility. ### Omni-Evaluator and Unified Benchmarking * The Omni-Evaluator serves as a centralized platform that integrates various evaluation engines and benchmarks into a single workflow. * It supports real-time monitoring of model performance, allowing researchers to track progress during the training and validation phases. * The system is designed specifically to handle the complexities of Multimodal LLMs, providing a standardized environment for diverse testing scenarios. * User-driven triggers are supported, enabling developers to initiate specific evaluation cycles manually when necessary. ### VLOps Dashboard and User Experience * The VLOps Dashboard acts as a central hub where users can manage the entire ML lifecycle without needing deep knowledge of the underlying orchestration logic. * Users can trigger complex pipelines simply by issuing a message, abstracting the technical difficulties of cloud infrastructure. * The dashboard provides a visual interface for monitoring events, message flows, and evaluation results, improving overall transparency for data scientists and researchers. For organizations managing large-scale multimodal models, moving toward an event-driven architecture is highly recommended. This model reduces the overhead of maintaining rigid pipelines and allows engineering teams to focus on model quality rather than infrastructure orchestration.

aws

Build multi-step applications and AI workflows with AWS Lambda durable functions | AWS News Blog (opens in new tab)

AWS Lambda durable functions introduce a simplified way to manage complex, long-running workflows directly within the standard Lambda experience. By utilizing a checkpoint and replay mechanism, developers can now write sequential code for multi-step processes that automatically handle state management and retries without the need for external orchestration services. This feature significantly reduces the cost of long-running tasks by allowing functions to suspend execution for up to one year without incurring compute charges during idle periods. ### Durable Execution Mechanism * The system uses a "durable execution" model based on checkpointing and replay to maintain state across function restarts. * When a function is interrupted or resumes from a pause, Lambda re-executes the handler from the beginning but skips already-completed operations by referencing saved checkpoints. * This architecture ensures that business logic remains resilient to failures and can survive execution environment recycles. * The execution state can be maintained for extended periods, supporting workflows that require human intervention or long-duration external processes. ### Programming Primitives and SDK * The feature requires the inclusion of a new open-source durable execution SDK in the function code. * **Steps:** The `context.step()` method defines specific blocks of logic that the system checkpoints and automatically retries upon failure. * **Wait:** The `context.wait()` primitive allows the function to terminate and release compute resources while waiting for a specified duration, resuming only when the time elapses. * **Callbacks:** Developers can use `create_callback()` to pause execution until an external event, such as an API response or a manual approval, is received. * **Advanced Control:** The SDK includes `wait_for_condition()` for polling external statuses and `parallel()` or `map()` operations for managing concurrent execution paths. ### Configuration and Setup * Durable execution must be enabled at the time of the Lambda function's creation; it cannot be retroactively enabled for existing functions. * Once enabled, the function maintains the same event handler structure and service integrations as a standard Lambda function. * The environment is specifically optimized for high-reliability use cases like payment processing, AI agent orchestration, and complex order management. AWS Lambda durable functions represent a major shift for developers who need the power of stateful orchestration but prefer to keep their logic within a single code-based environment. It is highly recommended for building AI workflows and multi-step business processes where state persistence and cost-efficiency are critical requirements.