The Lean Orchestration Manifesto: Narrowing Orchestrator Scope Creates Better ML Development

In this Lean Orchestration Manifesto, we advocate for a more streamlined approach to ML pipelines by narrowing the scope of orchestrators. Orchestrators should be used to schedule, log, and retry tasks, but should not be the primary runtime of your production ML pipelines.

Paul Yang

ML @ 🏃‍♀️Runhouse🏠

Published August 22, 2024

Overview

Orchestrators are the workhorse of the AI/ML world, utilized widely in both training and offline inference to schedule tasks, ensure reliable execution, define resource usage, help engineers debug, and quite a bit more. However, orchestrators have become overburdened as their flexibility is abused to accommodate the management of heterogeneous compute, dependencies, and state in ML systems.

Concretely, most organizations (maybe yours as well) have a multi-week research-to-production process and this is a horrifying inefficiency that we do not see anywhere else in software development. This is driven by a need to repackage the work done up to that “to-production” stage into some type of DAG-based system and debug it for production on production hardware and data. But orchestrators, which were not designed for iteration or runtime control, are an imperfect tool to receive this ML pipeline code.

But we can deploy faster, improve devX, and enable debugability by putting a limit on what a “lean orchestrator” does:

Orchestrators Should:

Schedule execution runs
Observe and log execution
Provide reliability and robustness

Orchestrators Should Not:

Serve as your primary runtime
Be used for debugging
Be a monolithic ML application

In a “lean orchestrator” context, use the orchestrator to schedule and log calls to remote compute. The actual definition of what to run is not packed into the nodes of the orchestrator DAG, but rather written in normal Python code and classes, and dispatched to remote compute with Runhouse.

To make this more concrete and show you how lean orchestration works, we will walk through three key benefits: 1) extremely simple research to production, 2) total control over compute, and 3) collaborative reuse of tasks. If you would like to see the example rendered in a code example before diving into the benefits, check out our Torch training example.

Why Does Lean Orchestration Work?

Eliminate Research to Production Overhead

In “heavy” orchestration, code must be packed into an orchestrator pipeline after research to reach production, but the process of debugging that pipeline working is incredibly painful. Iteration requires building and resubmitting a new pipeline and then often executing the task in full. This is extremely slow. Local testing when pipelines need GPUs or distributed compute is impossible, and while it’s sometimes possible to tunnel into each of the pipeline nodes on production hardware, that is not a real development experience. And, more abstractly, doing substantial code iteration within orchestrator pipelines is the wrong pattern to begin with! There’s no such thing as “research frontend code” and “production frontend code” and a multi-week process to translate between them.

The correct pattern should enable researchers to have fast local iteration while being production-ready at the end of the research process with no extra repackaging for pipelines. This is achieved by Runhouse, which lets researchers instantly send or redeploy code written in local IDEs for remote execution on full-scale cloud compute. Practically, usage during local development feels identical to writing a local function and calling or updating it. But for production, an orchestrator can make an identical call to deploy or use the same remote function. No rewriting is necessary.

Code lives as code outside of the orchestrator. The control over compute and dispatch is syntactically identical within orchestrator tasks and in developer local IDEs.

Debugging in Production

Once in production, pipelines can still break on specific runs. Runhouse remote objects can be persisted and can even be stateful. Debugging a pipeline is as simple as accessing the remote objects by name and interacting with it. The interactive development that was possible in research is also interactive debugging in production over a specific run.

Total Compute Control

Orchestrator tasks can request specific resources for tasks, but this is not particularly flexible and is defined in configurations rather than being programmable. More fundamentally, there’s no particular reason why my orchestrator, as a scheduler, should care whether preprocessing happens with or without a GPU, or on Google Cloud or AWS. Rather than specifying resource requirements as part of an orchestrator task, compute should be adaptable within the code of the pipeline itself.

Runhouse is designed to let you programmatically control compute requirements. Tasks can be executed depending on which clusters have available compute and which clusters are the cheapest. Or, build control logic within the task to up bigger and bigger boxes upon out-of-memory as a failover. Especially as ML teams build more and more pipelines, the fine-grained control that Runhouse offers over compute leads to significant cost benefits over the inflexible way that orchestrators define resources and run.

Multi-Cluster Support

Total compute control means being able to run across clusters, regions, or even multi-cloud. The configs and compute control are in code, the orchestrator blindly invokes the task without caring about which remote resources run the task. Whether you have startup credits and want to spend them in a different cloud, or you’re quota limited by operating within a single one, the ability to run multi-cloud leads to gains in efficiency and cost.

Service Reuse and Sharing

We’ve seen at organizations like Meta in which an embeddings function is reproduced in full across dozens of independent pipelines. This creates significant overhead in re-doing the same work and maintaining the same function multiple times while eliminating cross-team reproducibility. Moreover, functional improvements or operational changes like logging might be added to one instance, but not across all usage. Rather than packaging code into orchestrators, moving pipeline steps into microservice-like modules enables collaboration and reproducibility.

Runhouse’s dispatch grants the ability to easily turn code into microservices at scale, either living on persisted compute or living as code (unserialized) that is on-demand deployable (near-instantly). If an embeddings function has been dispatched to remote compute once and saved, everyone using an embeddings function can just load the function by name: `embed_fn = rh.function(name="embed_fn")`. Multiple executions across multiple pipelines behave in the same way. Moreover, any improvements or changes only need to be implemented once, and the platform team can make major upgrades to key functionality.

Moving Towards Lean Orchestrators

We’ve observed that when an organization’s ML team becomes more sophisticated, with more than a dozen active ML projects, the current tools and systems begin to break down. Challenges in time to production and operating in production become a material cost and slow down the ML flywheel.

Lean orchestration is easy to adopt because it operates within existing tool stacks (keep your orchestrator), within your cloud (don’t buy new compute), and can be adopted incrementally (new projects). It is only one piece of the puzzle, but it can lead to material savings in both time and compute costs; typically, organizations will at least immediately see a 50% reduction in cloud bills through flexibility and more intelligent execution.

Here’s how you get started: (Full Example - https://www.run.house/examples/airflow-model-training-example-pytorch-mnist)

Bring up a remote cluster (using code)
Send any local Python function to that remote cluster in 1 line, setting up a service
Call that service as if it were a local function from anywhere, whether in your Jupyter Notebook or IDE to do research work.
When you are done, just call that function from the orchestrator in the exact same way.

import kubetorch as kt


# Take any Python function or class

def local_preproc_fn(dataset, filters):

...


# Easily allocate the resources you want from your cluster
compute = kt.Compute(gpus="A10G:1")

# Deploy your function to your compute (not pickling!)
remote_preproc_fn = rh.function(local_preproc_fn).to(compute)


# Call it normally, but it's running remotely. This can be done from within a local IDE or notebook when iterating on the function, but called in an identical way from an orchestrator node
processed_data = remote_preproc_fn(dataset, filters)

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.

The Lean Orchestration Manifesto: Narrowing Orchestrator Scope Creates Better ML Development

Overview

Why Does Lean Orchestration Work?

Eliminate Research to Production Overhead

Total Compute Control

Service Reuse and Sharing

Moving Towards Lean Orchestrators

Stay up to speed 🏃‍♀️📩

Read More

Please separate orchestration and execution (Or: Packaging hell is dragging ML)

Supercharging Apache Airflow with Runhouse

Lean Data Automation: A Principal Components Approach