How to Adopt Ray into Your ML Platform (Gently)

Many ML teams are looking to adopt Ray. However, the adoption process requires a significant and usually painful infrastructural lift. We explore why this is and propose a better way to get your team productive with Ray.

Paul Yang

ML @ 🏃‍♀️Runhouse🏠

Donny Greenberg

CEO @ 🏃‍♀️Runhouse🏠

Published November 21, 2024

Ray Has Become a Leading Distributed Framework

What do ML teams at companies like OpenAI, Anthropic, Uber, Shopify, Spotify, Pinterest, and others have in common? They’ve adopted Ray for their machine learning (ML) workloads. Ray provides robust distributed execution and parallelization capabilities, with intuitive abstractions that support most common ML libraries and tools. This enables organizations to scale their ML pipelines and tackle large-scale, complex training and inference use cases.

While Ray has been around for some time, its momentum accelerated significantly in 2022, driven by two relatively independent trends. First, large language models (LLMs) became popularized and widely available, but the models were a size that required distributed computing for both training and inference; this is Ray’s specialty. However, Ray is not solely an LLM phenomenon; it has also gained traction in standard enterprise use cases, such as recommender systems, supported by a growing list of credible adopters. For teams moving beyond single-node training over tabular data, the social proof of these high-profile users validated Ray as a worthwhile investment.

Even though Ray offers significant benefits, migrating existing systems to Ray is a major undertaking. Tactically, setting up Ray clusters and standardizing their launch processes requires substantial infrastructure effort from platform teams. Additionally, there is a learning curve for end-user engineers. At a macro level, organizations who adopt Ray frequently adopt Ray as a universal runtime, mandating that all workloads run on Ray clusters. This leads to lock-in and the use of Ray even in scenarios where it might underperform.

In this discussion, we will explore these challenges in more depth and propose a gentler approach for teams to adopt Ray. By adopting Ray incrementally and applying it to the right projects, organizations can preserve their flexibility to use other current and future ML frameworks. More broadly, it is fundamentally unsustainable to frequently rearchitect your ML platform every 2–3 years in response to new frameworks. The approach we recommend makes Ray simple to adopt today, but your compute and platform remain flexible and unopinionated.

Why is Ray Powerful?

Ray is a distributed runtime and domain-specific language (DSL) that provides a flexible platform for building and deploying distributed applications. We assume you have some high-level familiarity with Ray. If not, we recommend briefly reviewing their Overview before continuing.

Clearly, there’s something uniquely compelling about Ray, given that so many leading tech innovators have adopted it for at least some of their ML workflows. Among the key strengths most frequently cited by teams, these stand out:

Ease of parallelism: Ray is designed to make parallel and distributed computing intuitive for Python developers, significantly lowering the barriers to scaling applications. Through Ray's higher-level APIs, developers can parallelize code without dealing with complex infrastructure setups since Ray handles the distribution and communication over nodes. Moreover, work scales from single-node to large multi-node execution without requiring meaningful code changes.
Rich library ecosystem: Ray provides abstractions for numerous key ML frameworks, supporting end-to-end workflows in the library that your team is already using and familiar with. For instance, in training, Ray works with PyTorch, TensorFlow, Keras, Accelerate, DeepSpeed, and more.
Portability of the Ray runtime: Code that runs on a Ray cluster in research can run on another cluster in production with minimal changes. There's no more "research-to-production" gap, where a notebook is tossed over a wall to a ML team, who must painstakingly translate research work into a pipeline over weeks or months. This reproducibility is invaluable for teams conducting hundreds of ML experiments and deploying across multiple teams, and is one of the biggest benefits of Ray reported by mature users. However, this benefit comes with the drawback of needing to launch Ray clusters everywhere and for every task across research and production.

Despite these strengths, our discussions with ML teams have surfaced several misconceptions about Ray. Specifically, Ray is not:

A long-lived cluster scheduler like Kubernetes: Ray clusters are designed to be more ephemeral, created for specific workloads rather than serving as persistent infrastructure.
A serverless-like interface for dispatching work: Ray excels at scheduling distributed computations within a cluster but does not provide an external serverless environment.

In concept, adopting Ray into your toolset should not be more difficult than experimenting with any other ML library – it is a handy distributed DSL. But in practice, there are significant practical barriers to Ray adoption.

Why Is It Hard to Adopt Ray?

Teams that have adopted Ray have encountered similar challenges on their adoption journey. We’ve spoken to dozens of ML teams using Ray in production, as well as many others looking to adopt it. For teams that are not operating at "big tech" scale, where rearchitecting is almost guaranteed to deliver ROI, Ray introduces operational overhead that can be difficult to justify.

The biggest challenge is the major infrastructure overhaul required to adopt Ray. All systems must be configured to launch Ray clusters. At the same time, researchers need a solid understanding of Ray’s behavior, dependency requirements, distributed execution principles, and, in many cases, a crash course in Kubernetes. On top of that, Ray can be highly disruptive to the development workflow. Ray code typically needs to run directly on the cluster to avoid issues with dependencies and serialization. This leads to restrictions on how developers test and iterate on their code and effectively prevents local execution.

During execution, fault tolerance can also become a challenge. Ray’s tightly coupled architecture makes failures difficult to isolate and recover from. When one component in a Ray workload fails, the system’s tight coupling often leads to cascading failures across the cluster. This makes error handling and recovery significantly harder and frequently results in lost compute and increased troubleshooting time.

Lastly, adding Ray to your stack means adding another layer to an already complex framework ecosystem. Many companies already run distributed frameworks like Spark, PyTorch Distributed, or Dask, and adding Ray to the mix can introduce compatibility challenges and debugging headaches. In cases where Ray interacts with other frameworks, such as running Ray-on-Spark, the potential for complex dependency and environmental issues multiplies, with cryptic errors becoming more common. For example, Ray sets the OMP_NUM_THREADS variable inside all its processes, which can lead to unexpected impacts on thread-based workloads.

Adopting Ray without Upending Your Infrastructure

We advocate a surgical approach to adopting Ray that avoids the high cost and risks of jumping headfirst into a full-scale migration to Ray as a universal runtime:

Incremental Adoption: Leverage Ray selectively, adding it as a tool within your arsenal rather than committing to it as a universal runtime. Do not enforce a workflow where every task must universally be run on a Ray cluster.
Send Work to Ray Clusters within Pipelines: By sending work to Ray clusters than doing work directly on the Ray clusters, you avoid the need to rebuild your entire production stack around Ray clusters. This minimizes infrastructure changes required, whereas first-generation Ray adoption
Pythonic Error Handling: Handle faults Pythonically from outside the Ray cluster to prevent cascading failures in production and enable safe retry or cluster restart if needed.
Launch Ray clusters with Kubetorch: Kubetorch by Runhouse was built with Ray usage in mind, and is one of the easiest ways to start using Ray. If you have a Kubeconfig or cloud credential, you can start dispatching programs using Ray immediately with no additional infra setup required. We handle all the orchestration, init, and log streaming.

This surgical, incremental approach allows you to benefit from Ray's strengths without the risks of wholesale migration or over-commitment to a single distributed platform. Specifically, one of the key use cases that Kubetorch was built to unlock is letting engineers take advantage of Ray via a simple API and abstract compute.

Below, we share a hyperparemeter optimization (HPO) example:

The cluster is launched from elastic compute and the requirements for the compute are defined in code, there is not a separate config or a Ray-cluster launch system.
The code is written in undecorated Python that take advantage of Ray as a library, but do not worry about the infrastructure around it.
To take advantage of Ray, we simply call .distribute(“ray”). Otherwise, using Ray is identical to calling any other function.

If you'd like to get in touch, just shoot us a quick note at hello@run.house and we'd be happy to chat or share Ray adoption war stories.

import time
from typing import Any, Dict
import kubetorch as kt
from ray import train, tune

# Number of workers and jobs for distributed training
NUM_WORKERS = 8
NUM_JOBS = 30

# A simple function for training (simulating time delay)
def train_fn(step, width, height):
    time.sleep(5)  # Simulate time delay
    return (0.1 + width * step / 100) ** (-1) + height * 0.1  # Compute score based on parameters

# Custom Ray Tune Trainable class for distributed training
class Trainable(tune.Trainable):
    def setup(self, config: Dict[str, Any]):
        self.step_num = 0  # Initialize step counter
        self.reset_config(config)

    def reset_config(self, new_config: Dict[str, Any]):
        self._config = new_config  # Update configuration
        return True

    def step(self):
        # Call the training function and increment step
        score = train_fn(self.step_num, **self._config)
        self.step_num += 1
        return {"score": score}

    def cleanup(self):
        super().cleanup()  # Cleanup any resources

    def load_checkpoint(self, checkpoint_dir: str):
        return None  # No checkpointing in this example

    def save_checkpoint(self, checkpoint_dir: str):
        return None  # No checkpointing in this example

# Alternative trainable function to pass directly to Tune's Tuner
def trainable(config):
    # Perform 10 steps of training
    for step_num in range(10):
        from hpo_train_fn import train_fn  # Import the training function dynamically
        score = train_fn(step_num, **config)
        train.report(score=score)  # Report score to Ray Tune

# Function to search for the best hyperparameters using Ray Tune
def find_minimum(num_concurrent_trials=2, num_samples=4, metric_name="score"):
    search_space = {
        "width": tune.uniform(0, 20),  # Hyperparameter search space for width
        "height": tune.uniform(-100, 100),  # Hyperparameter search space for height
    }

    # Set up the Ray Tune Tuner
    tuner = tune.Tuner(
        Trainable,  # Use custom Trainable class
        tune_config=tune.TuneConfig(
            metric=metric_name,  # Metric to optimize
            mode="max",  # Maximize the metric
            max_concurrent_trials=num_concurrent_trials,  # Max concurrent trials
            num_samples=num_samples,  # Number of trials to run
            reuse_actors=True,  # Reuse actors to speed up training
        ),
        param_space=search_space,  # Parameter space to explore
    )
    tuner.fit()  # Run the tuning process
    return tuner.get_results().get_best_result()  # Return the best result

if __name__ == "__main__":
    # Launch compute from your Kubetorch cluster 
    head = kt.Compute(num_cpus=4, image=kt.images.ray())
    
    # Send your function to the remote compute (<2 seconds)
    remote_find_minimum = kt.function(find_minimum).to(head).distribute("ray", num_nodes=2)

    # Call the remote function to find the best hyperparameter configuration
    best_result = remote_find_minimum()

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.

How to Adopt Ray into Your ML Platform (Gently)

Ray Has Become a Leading Distributed Framework

Why is Ray Powerful?

Why Is It Hard to Adopt Ray?

Adopting Ray without Upending Your Infrastructure

Stay up to speed 🏃‍♀️📩

Read More

Will LLMs Replace ML Training (No, here's why)

A PyTorch Approach to ML Infrastructure

AI Platforms: Past, Present, and Future