Launch and Scale with PyTorch Distributed

Runhouse is the fastest and easiest way to launch multi-node, multi-GPU PyTorch workloads on your own compute.

Works with Your Code and Stack

Runhouse makes it extremely easy to deploy research, training, and inference code written in normal Python on elastic compute or Kubernetes.

  • Runhouse works with your training loop or inference pipelines as-is, with no decorators, repackaging, or DSLs.
  • Compute launch and execution are all in Python so you can use best practice software DevOps on your code and pipelines.

Headache-Free Scale Up

Runhouse is the easiest way to just go faster with no additional infrastructure overhead or platforms team lift.

  • Scale from single GPU to distributed training with a single line of code.
  • Manage compute and execution with full visibility and control, including direct SSH access into the distributed clusters.
  • Workloads launch and scale within your own cloud accounts.

Runhouse Github

Github

Runhouse is a flexible framework for building a true ML Platform.

Launch Multi-Node Clusters Programmatically

Define your compute requirements and specify cloud provider, region, required resources, and more.

import runhouse as rh # Create a multi-node cluster gpus_per_node = 1 num_nodes = 16 img = rh.Image(name="runhouse-image").install_packages( [ "torch==2.5.1", "torchvision==0.20.1", "Pillow==11.0.0", ], ) gpu_cluster = rh.cluster( name=f"rh-{num_nodes}x{gpus_per_node}-gpu", gpus=f"A100:{gpus_per_node}", num_nodes=num_nodes, use_spot=True, # Can use spot instances easily provider="aws", image=img, ).up_if_not()

Run PyTorch Distributed

Runhouse is the easiest way to start and robustly execute distributed trainings. Scale up with just one line of code.

import runhouse as rh # Write a regular Python class and send it to our cluster from resnet_training import ResNet152Trainer remote_trainer_class = rh.module(ResNet152Trainer).to( cluster ) # Instantiate a remote instance and call .distribute() to setup PyTorch remote_trainer = remote_trainer_class().distribute( distribution="pytorch", replicas_per_node=gpus_per_node, num_replicas=gpus_per_node * num_nodes, )

Flexible and Developer Friendly

Use Runhouse wherever you like and execution happens identically whether in Notebooks, orchestrators, or CI/CD platforms.

# Call remote objects on the cluster as if they were local. remote_trainer.train( num_epochs=epochs, num_classes=1000, train_data_path=train_data_path, val_data_path=val_data_path, ) # Make multi-threaded calls against remote instances remote_trainer.get_status()
import runhouse as rh # Create a multi-node cluster gpus_per_node = 1 num_nodes = 16 img = rh.Image(name="runhouse-image").install_packages( [ "torch==2.5.1", "torchvision==0.20.1", "Pillow==11.0.0", ], ) gpu_cluster = rh.cluster( name=f"rh-{num_nodes}x{gpus_per_node}-gpu", gpus=f"A100:{gpus_per_node}", num_nodes=num_nodes, use_spot=True, # Can use spot instances easily provider="aws", image=img, ).up_if_not()

Everything you needโ€จto get started with Runhouse today.

ย ย See an Example

Learn more about the technical details of Runhouse and try implementing the open-source package into your existing Python code. Here's an example of how to deploy Llama3 to EC2 in just a few lines.

See an Example

ย ย Talk to Donny (our founder)

We've been building ML platforms and open-source libraries like PyTorch for over a decade. We'd love to chat and get your feedback!

Book Time

Get in touch ๐Ÿ‘‹

Whether you'd like to learn more about Runhouse or need a little assistance trying out the product, we're here to help.