data-pipelines

1 posts

netflix

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix has significantly optimized Maestro, its horizontally scalable workflow orchestrator, to meet the evolving demands of low-latency use cases like live events, advertising, and gaming. By redesigning the core engine to transition from a polling-based architecture to a high-performance event-driven model, the team achieved a 100x increase in speed. This evolution reduced workflow overhead from several seconds to mere milliseconds, drastically improving developer productivity and system efficiency. ### Limitations of the Legacy Architecture The original Maestro architecture was built on a three-layer system that, while scalable, introduced significant latency during execution. * **Polling Latency:** The internal flow engine relied on calling execution functions at set intervals, creating a "speedbump" where tasks waited seconds to be picked up by workers. * **Execution Overhead:** The process of translating complex workflow graphs into parallel flows and sequentially chained tasks added internal processing time that hindered sub-hourly and ad-hoc workloads. * **Concurrency Issues:** A lack of strong guarantees from the internal flow engine occasionally led to race conditions, where a single step might be executed by multiple workers simultaneously. ### Transitioning to an Event-Driven Engine To support the highest level of user needs, Netflix replaced the traditional flow engine with a custom, high-performance execution model. * **Direct Dispatching:** The engine moved away from periodic polling in favor of an event-driven mechanism that triggers state transitions instantly. * **State Machine Optimization:** The new design manages the lifecycle of workflows and steps through a more streamlined state machine, ensuring faster transitions between "start," "restart," "stop," and "pause" actions. * **Reduced Data Latency:** The team optimized data access patterns for internal state storage, reducing the time required to write Maestro data to the database during high-volume executions. ### Scalability and Functional Improvements The redesign not only improved speed but also strengthened the engine's ability to handle massive, complex data pipelines. * **Isolation Layers:** The engine maintains strict isolation between the Maestro step runtime (integrated with Spark and Trino) and the underlying execution logic. * **Support for Heterogeneous Workflows:** The supercharged engine continues to support massive workflows with hundreds of thousands of jobs while providing the low latency required for iterative development cycles. * **Reliability Guarantees:** By moving to a more robust internal event bus, the system eliminated the race conditions found in the previous distributed job queue implementation. For organizations managing large-scale Data or ML workflows, moving toward an event-driven orchestration model is essential for supporting sub-hourly execution and low-latency ad-hoc queries. These performance improvements are now available in the Maestro open-source project for wider community adoption.