durable-execution

2 posts

netflix

How Temporal Powers Reliable Cloud Operations at Netflix | by Netflix Technology Blog | Dec, 2025 | Netflix TechBlog (opens in new tab)

Netflix has significantly enhanced the reliability of its global continuous delivery platform, Spinnaker, by adopting Temporal for durable execution of cloud operations. By migrating away from a fragile, polling-based orchestration model between its internal services, the engineering team successfully reduced transient deployment failures from 4% to a remarkable 0.0001%. This shift has allowed developers to write complex, long-running operational logic as standard code while the underlying platform handles state persistence and fault recovery. ### Limitations of Legacy Orchestration * **The Polling Bottleneck:** Originally, Netflix's orchestration engine (Orca) communicated with its cloud interface (Clouddriver) via a synchronous POST request followed by continuous polling of a GET endpoint to track task status. * **State Fragility:** Clouddriver utilized an internal orchestration engine that relied on in-memory state or volatile Redis storage, meaning if a Clouddriver instance crashed mid-operation, the deployment state was often lost, leading to "zombie" tasks or failed deployments. * **Manual Error Handling:** Developers had to manually implement complex retry logic, exponential backoffs, and state checkpointing for every cloud operation, which was both error-prone and difficult to maintain. ### Transitioning to Durable Execution with Temporal * **Abstraction of Failures:** Temporal provides a "Durable Execution" platform where the state of a workflow—including local variables and thread stacks—is automatically persisted. This allows code to run "as if failures don’t exist," as the system can resume exactly where it left off after a process crash or network interruption. * **Workflows and Activities:** Netflix re-architected cloud operations into Temporal Workflows (orchestration logic) and Activities (idempotent units of work like calling an AWS API). This separation ensures that the orchestration logic remains deterministic while external side effects are handled reliably. * **Eliminating Polling:** By using Temporal’s signaling and long-running execution capabilities, Netflix moved away from the heavy overhead of thousands of services polling for status updates, replacing them with a push-based, event-driven model. ### Impact on Cloud Operations * **Dramatic Reliability Gains:** The most significant outcome was the near-elimination of transient failures, moving from a 4% failure rate to 0.0001%, ensuring that critical updates to the Open Connect CDN and Live streaming infrastructure are executed with high confidence. * **Developer Productivity:** Using Temporal’s SDKs, Netflix engineers can now write standard Java or Go code to define complex deployment strategies (like canary releases or blue-green deployments) without building custom state machines or management layers. * **Operational Visibility:** Temporal provides a native UI and history audit log for every workflow, giving operators deep visibility into exactly which step of a deployment failed and why, along with the ability to retry specific failed steps manually if necessary. For organizations managing complex, distributed cloud infrastructure, adopting a durable execution framework like Temporal is highly recommended. It moves the burden of state management and fault tolerance from the application layer to the platform, allowing engineers to focus on business logic rather than the mechanics of distributed systems failure.

aws

Build multi-step applications and AI workflows with AWS Lambda durable functions | AWS News Blog (opens in new tab)

AWS Lambda durable functions introduce a simplified way to manage complex, long-running workflows directly within the standard Lambda experience. By utilizing a checkpoint and replay mechanism, developers can now write sequential code for multi-step processes that automatically handle state management and retries without the need for external orchestration services. This feature significantly reduces the cost of long-running tasks by allowing functions to suspend execution for up to one year without incurring compute charges during idle periods. ### Durable Execution Mechanism * The system uses a "durable execution" model based on checkpointing and replay to maintain state across function restarts. * When a function is interrupted or resumes from a pause, Lambda re-executes the handler from the beginning but skips already-completed operations by referencing saved checkpoints. * This architecture ensures that business logic remains resilient to failures and can survive execution environment recycles. * The execution state can be maintained for extended periods, supporting workflows that require human intervention or long-duration external processes. ### Programming Primitives and SDK * The feature requires the inclusion of a new open-source durable execution SDK in the function code. * **Steps:** The `context.step()` method defines specific blocks of logic that the system checkpoints and automatically retries upon failure. * **Wait:** The `context.wait()` primitive allows the function to terminate and release compute resources while waiting for a specified duration, resuming only when the time elapses. * **Callbacks:** Developers can use `create_callback()` to pause execution until an external event, such as an API response or a manual approval, is received. * **Advanced Control:** The SDK includes `wait_for_condition()` for polling external statuses and `parallel()` or `map()` operations for managing concurrent execution paths. ### Configuration and Setup * Durable execution must be enabled at the time of the Lambda function's creation; it cannot be retroactively enabled for existing functions. * Once enabled, the function maintains the same event handler structure and service integrations as a standard Lambda function. * The environment is specifically optimized for high-reliability use cases like payment processing, AI agent orchestration, and complex order management. AWS Lambda durable functions represent a major shift for developers who need the power of stateful orchestration but prefer to keep their logic within a single code-based environment. It is highly recommended for building AI workflows and multi-step business processes where state persistence and cost-efficiency are critical requirements.