100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)
Netflix has significantly optimized Maestro, its horizontally scalable workflow orchestrator, to meet the evolving demands of low-latency use cases like live events, advertising, and gaming. By redesigning the core engine to transition from a polling-based architecture to a high-performance event-driven model, the team achieved a 100x increase in speed. This evolution reduced workflow overhead from several seconds to mere milliseconds, drastically improving developer productivity and system efficiency.
Limitations of the Legacy Architecture
The original Maestro architecture was built on a three-layer system that, while scalable, introduced significant latency during execution.
- Polling Latency: The internal flow engine relied on calling execution functions at set intervals, creating a "speedbump" where tasks waited seconds to be picked up by workers.
- Execution Overhead: The process of translating complex workflow graphs into parallel flows and sequentially chained tasks added internal processing time that hindered sub-hourly and ad-hoc workloads.
- Concurrency Issues: A lack of strong guarantees from the internal flow engine occasionally led to race conditions, where a single step might be executed by multiple workers simultaneously.
Transitioning to an Event-Driven Engine
To support the highest level of user needs, Netflix replaced the traditional flow engine with a custom, high-performance execution model.
- Direct Dispatching: The engine moved away from periodic polling in favor of an event-driven mechanism that triggers state transitions instantly.
- State Machine Optimization: The new design manages the lifecycle of workflows and steps through a more streamlined state machine, ensuring faster transitions between "start," "restart," "stop," and "pause" actions.
- Reduced Data Latency: The team optimized data access patterns for internal state storage, reducing the time required to write Maestro data to the database during high-volume executions.
Scalability and Functional Improvements
The redesign not only improved speed but also strengthened the engine's ability to handle massive, complex data pipelines.
- Isolation Layers: The engine maintains strict isolation between the Maestro step runtime (integrated with Spark and Trino) and the underlying execution logic.
- Support for Heterogeneous Workflows: The supercharged engine continues to support massive workflows with hundreds of thousands of jobs while providing the low latency required for iterative development cycles.
- Reliability Guarantees: By moving to a more robust internal event bus, the system eliminated the race conditions found in the previous distributed job queue implementation.
For organizations managing large-scale Data or ML workflows, moving toward an event-driven orchestration model is essential for supporting sub-hourly execution and low-latency ad-hoc queries. These performance improvements are now available in the Maestro open-source project for wider community adoption.