Launch and Scale with PyTorch Distributed

Runhouse is the fastest and easiest way to launch multi-node, multi-GPU PyTorch workloads on your own compute.

Works with Your Code and Stack

Runhouse makes it extremely easy to deploy research, training, and inference code written in normal Python on elastic compute or Kubernetes.

Runhouse works with your training loop or inference pipelines as-is, with no decorators, repackaging, or DSLs.
Compute launch and execution are all in Python so you can use best practice software DevOps on your code and pipelines.

Get a Demo

Headache-Free Scale Up

Runhouse is the easiest way to just go faster with no additional infrastructure overhead or platforms team lift.

Scale from single GPU to distributed training with a single line of code.
Manage compute and execution with full visibility and control, including direct SSH access into the distributed clusters.
Workloads launch and scale within your own cloud accounts.

Get Started

Runhouse Github

Github

Runhouse is a flexible framework for building a true ML Platform.

Launch Multi-Node Clusters Programmatically

Define your compute requirements and specify cloud provider, region, required resources, and more.

import runhouse as rh 

# Create a multi-node cluster 
gpus_per_node = 1
num_nodes = 16

img = rh.Image(name="runhouse-image").install_packages(
    [
        "torch==2.5.1",
        "torchvision==0.20.1",
        "Pillow==11.0.0",
    ],
)
gpu_cluster = rh.cluster(
    name=f"rh-{num_nodes}x{gpus_per_node}-gpu",
    gpus=f"A100:{gpus_per_node}",
    num_nodes=num_nodes,
    use_spot=True, # Can use spot instances easily
    provider="aws",
    image=img,
).up_if_not()

Run PyTorch Distributed

Runhouse is the easiest way to start and robustly execute distributed trainings. Scale up with just one line of code.

import runhouse as rh

# Write a regular Python class and send it to our cluster
from resnet_training import ResNet152Trainer
remote_trainer_class = rh.module(ResNet152Trainer).to(
    cluster
)

# Instantiate a remote instance and call .distribute() to setup PyTorch
remote_trainer = remote_trainer_class().distribute(
    distribution="pytorch",
    replicas_per_node=gpus_per_node,
    num_replicas=gpus_per_node * num_nodes,
)

Flexible and Developer Friendly

Use Runhouse wherever you like and execution happens identically whether in Notebooks, orchestrators, or CI/CD platforms.

# Call remote objects on the cluster as if they were local. 
remote_trainer.train(
    num_epochs=epochs,
    num_classes=1000,
    train_data_path=train_data_path,
    val_data_path=val_data_path,
)

# Make multi-threaded calls against remote instances
remote_trainer.get_status()

import runhouse as rh 

# Create a multi-node cluster 
gpus_per_node = 1
num_nodes = 16

img = rh.Image(name="runhouse-image").install_packages(
    [
        "torch==2.5.1",
        "torchvision==0.20.1",
        "Pillow==11.0.0",
    ],
)
gpu_cluster = rh.cluster(
    name=f"rh-{num_nodes}x{gpus_per_node}-gpu",
    gpus=f"A100:{gpus_per_node}",
    num_nodes=num_nodes,
    use_spot=True, # Can use spot instances easily
    provider="aws",
    image=img,
).up_if_not()

Everything you need to get started with Runhouse today.

See an Example

Learn more about the technical details of Runhouse and try implementing the open-source package into your existing Python code. Here's an example of how to deploy Llama3 to EC2 in just a few lines.

See an Example

Talk to Donny (our founder)

We've been building ML platforms and open-source libraries like PyTorch for over a decade. We'd love to chat and get your feedback!

Book Time

Get in touch 👋

Whether you'd like to learn more about Runhouse or need a little assistance trying out the product, we're here to help.

Email

team@run.house

Discord

Join the convo

Launch and Scale with PyTorch Distributed

Works with Your Code and Stack

Headache-Free Scale Up

Runhouse Github

ResNet Training

ResNet with Lightning

Ray Hyperparameter Opt

Dask LightGBM

Runhouse is a flexible framework for building a true ML Platform.

Launch Multi-Node Clusters Programmatically

Run PyTorch Distributed

Flexible and Developer Friendly

Everything you need to get started with Runhouse today.

See an Example

Talk to Donny (our founder)

Get in touch 👋

Email

Discord

Twitter

Launch and Scale with PyTorch Distributed

Works with Your Code and Stack

Headache-Free Scale Up

Runhouse Github

ResNet Training

ResNet with Lightning

Ray Hyperparameter Opt

Dask LightGBM

Runhouse is a flexible framework for building a true ML Platform.

Launch Multi-Node Clusters Programmatically

Run PyTorch Distributed

Flexible and Developer Friendly

Everything you need to get started with Runhouse today.

See an Example

Talk to Donny (our founder)

Get in touch 👋

Email

Discord

Twitter

Everything you need to get started with Runhouse today.