cloud-operations

2 posts

netflix

How Temporal Powers Reliable Cloud Operations at Netflix | by Netflix Technology Blog | Dec, 2025 | Netflix TechBlog (opens in new tab)

Netflix has significantly enhanced the reliability of its global continuous delivery platform, Spinnaker, by adopting Temporal for durable execution of cloud operations. By migrating away from a fragile, polling-based orchestration model between its internal services, the engineering team successfully reduced transient deployment failures from 4% to a remarkable 0.0001%. This shift has allowed developers to write complex, long-running operational logic as standard code while the underlying platform handles state persistence and fault recovery. ### Limitations of Legacy Orchestration * **The Polling Bottleneck:** Originally, Netflix's orchestration engine (Orca) communicated with its cloud interface (Clouddriver) via a synchronous POST request followed by continuous polling of a GET endpoint to track task status. * **State Fragility:** Clouddriver utilized an internal orchestration engine that relied on in-memory state or volatile Redis storage, meaning if a Clouddriver instance crashed mid-operation, the deployment state was often lost, leading to "zombie" tasks or failed deployments. * **Manual Error Handling:** Developers had to manually implement complex retry logic, exponential backoffs, and state checkpointing for every cloud operation, which was both error-prone and difficult to maintain. ### Transitioning to Durable Execution with Temporal * **Abstraction of Failures:** Temporal provides a "Durable Execution" platform where the state of a workflow—including local variables and thread stacks—is automatically persisted. This allows code to run "as if failures don’t exist," as the system can resume exactly where it left off after a process crash or network interruption. * **Workflows and Activities:** Netflix re-architected cloud operations into Temporal Workflows (orchestration logic) and Activities (idempotent units of work like calling an AWS API). This separation ensures that the orchestration logic remains deterministic while external side effects are handled reliably. * **Eliminating Polling:** By using Temporal’s signaling and long-running execution capabilities, Netflix moved away from the heavy overhead of thousands of services polling for status updates, replacing them with a push-based, event-driven model. ### Impact on Cloud Operations * **Dramatic Reliability Gains:** The most significant outcome was the near-elimination of transient failures, moving from a 4% failure rate to 0.0001%, ensuring that critical updates to the Open Connect CDN and Live streaming infrastructure are executed with high confidence. * **Developer Productivity:** Using Temporal’s SDKs, Netflix engineers can now write standard Java or Go code to define complex deployment strategies (like canary releases or blue-green deployments) without building custom state machines or management layers. * **Operational Visibility:** Temporal provides a native UI and history audit log for every workflow, giving operators deep visibility into exactly which step of a deployment failed and why, along with the ability to retry specific failed steps manually if necessary. For organizations managing complex, distributed cloud infrastructure, adopting a durable execution framework like Temporal is highly recommended. It moves the burden of state management and fault tolerance from the application layer to the platform, allowing engineers to focus on business logic rather than the mechanics of distributed systems failure.

aws

New and enhanced AWS Support plans add AI capabilities to expert guidance | AWS News Blog (opens in new tab)

AWS has announced a major transformation of its support plans, moving from a reactive model to a proactive, AI-driven approach for issue prevention and workload optimization. By integrating AI-powered capabilities with deep technical expertise, these enhanced plans aim to help organizations identify potential operational risks before they impact business performance. This new tier-based structure provides businesses with varying levels of contextual assistance, ranging from intelligent automated recommendations to direct access to specialized engineering teams. ### Business Support+ * Introduces intelligent, AI-powered assistance designed to provide contextual recommendations for developers, startups, and small businesses. * Features a seamless transition from AI tools to human experts, with critical case response times reduced to 30 minutes—twice as fast as previous standards. * Provides personalized workload optimization suggestions based on the user's specific environment via a low-cost monthly subscription. ### Enterprise Support * Assigns a designated Technical Account Manager (TAM) who utilizes data-driven insights and AI tools to mitigate risks and identify optimization opportunities. * Grants access to the AWS Security Incident Response service at no additional fee, centralizing the tracking, monitoring, and investigation of security events. * Guarantees a 15-minute response time for production-critical issues, with support engineers receiving AI-generated context to ensure faster, more personalized resolution. * Includes access to hands-on workshops and interactive programs to foster continuous technical growth within the organization. ### Unified Operations Support * Provides the highest level of context-aware assistance through a dedicated core team including a TAM, a Domain Engineer, and a Senior Billing and Account Specialist. * Delivers industry-leading 5-minute response times for critical incidents, supported by around-the-clock monitoring and AI-powered proactive risk identification. * Offers on-demand access to specialized experts in migration, incident management, and security through the customer’s preferred collaboration channels. These updates reflect AWS’s commitment to using generative AI to shorten resolution times and provide more personalized architectural guidance. Organizations should evaluate their operational complexity and required response times to select the plan that best aligns with their mission-critical cloud needs.