workflow-orchestration

2 posts

netflix

Netflix's Metaflow Spin: Faster ML Development | Netflix TechBlog (opens in new tab)

Netflix has introduced Spin, a new functionality within the Metaflow framework designed to significantly accelerate the iterative development cycle for ML and AI workflows. By bridging the gap between the interactive speed of notebooks and the production-grade reliability of versioned workflows, Spin allows developers to experiment with stateful increments without the latency of full restarts. This enhancement ensures that the "prototype to production" pipeline remains fluid while maintaining the deterministic execution and explicit state management that Metaflow provides at scale. ### The Nature of ML and AI Iteration * ML and AI development is distinct from traditional software engineering because it involves large, mutable datasets and computationally expensive, stochastic processes. * State management is a primary concern in this domain, as reloading data or recomputing transformations for every minor code change creates a prohibitively slow feedback loop. * While notebooks like Jupyter or Marimo excel at preserving in-memory state for fast exploration, they often lead to "hidden state" problems and non-deterministic results due to out-of-order cell execution. ### Metaflow as a State-Aware Framework * Metaflow uses the `@step` decorator to define checkpoint boundaries where the framework automatically persists all instance variables as versioned artifacts. * The framework’s `resume` command allows developers to restart execution from a specific step, cloning previous state to avoid recomputing successful upstream tasks. * This architecture addresses notebook limitations by ensuring execution order is explicit and deterministic while making the state fully discoverable and versioned. ### Introducing Spin for Rapid Development * Spin is a new feature introduced in Metaflow 2.19 that further reduces the friction of the iterative development loop. * It aims to provide the near-instant feedback of a notebook environment while operating within the structure of a production-ready Metaflow workflow. * The tool helps developers manage the stateful nature of ML development, allowing for quick, incremental experimentation without losing continuity between code iterations. To improve data science productivity and reduce "waiting time" during the development phase, engineering teams should look to adopt Metaflow 2.19 and integrate Spin into their experimentation workflows.

netflix

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix has significantly optimized Maestro, its horizontally scalable workflow orchestrator, to meet the evolving demands of low-latency use cases like live events, advertising, and gaming. By redesigning the core engine to transition from a polling-based architecture to a high-performance event-driven model, the team achieved a 100x increase in speed. This evolution reduced workflow overhead from several seconds to mere milliseconds, drastically improving developer productivity and system efficiency. ### Limitations of the Legacy Architecture The original Maestro architecture was built on a three-layer system that, while scalable, introduced significant latency during execution. * **Polling Latency:** The internal flow engine relied on calling execution functions at set intervals, creating a "speedbump" where tasks waited seconds to be picked up by workers. * **Execution Overhead:** The process of translating complex workflow graphs into parallel flows and sequentially chained tasks added internal processing time that hindered sub-hourly and ad-hoc workloads. * **Concurrency Issues:** A lack of strong guarantees from the internal flow engine occasionally led to race conditions, where a single step might be executed by multiple workers simultaneously. ### Transitioning to an Event-Driven Engine To support the highest level of user needs, Netflix replaced the traditional flow engine with a custom, high-performance execution model. * **Direct Dispatching:** The engine moved away from periodic polling in favor of an event-driven mechanism that triggers state transitions instantly. * **State Machine Optimization:** The new design manages the lifecycle of workflows and steps through a more streamlined state machine, ensuring faster transitions between "start," "restart," "stop," and "pause" actions. * **Reduced Data Latency:** The team optimized data access patterns for internal state storage, reducing the time required to write Maestro data to the database during high-volume executions. ### Scalability and Functional Improvements The redesign not only improved speed but also strengthened the engine's ability to handle massive, complex data pipelines. * **Isolation Layers:** The engine maintains strict isolation between the Maestro step runtime (integrated with Spark and Trino) and the underlying execution logic. * **Support for Heterogeneous Workflows:** The supercharged engine continues to support massive workflows with hundreds of thousands of jobs while providing the low latency required for iterative development cycles. * **Reliability Guarantees:** By moving to a more robust internal event bus, the system eliminated the race conditions found in the previous distributed job queue implementation. For organizations managing large-scale Data or ML workflows, moving toward an event-driven orchestration model is essential for supporting sub-hourly execution and low-latency ad-hoc queries. These performance improvements are now available in the Maestro open-source project for wider community adoption.