Managed AI Training inside Your Own Cloud

Run distributed ML workloads (PyTorch, Ray, Spark, and more) without thinking about infrastructure. Cost-optimization, fault-tolerance, and push-button scaling built-in.

Pythonic and YAML-free
Multi-cloud + Kubernetes
Magic Iteration and Debugging
Robust Production Toolchain

Powerful compute at your fingertips

Access compute on-demand in code, without any infra setup or magic CLI/YAML incantations to learn. Just sign up, configure your compute pools (all inside your own VPC), and get running.

import runhouse as rh # Request 4 nodes with 8 GPUs each my_cluster = rh.cluster( name="rh-4x8xA100s", gpus="A100:8", num_nodes=4, image = rh.Image("base_image") \ .from_docker("nvcr.io/nvidia/pytorch:23.10-py3"), autostop_mins=60, ).up_if_not() # Load the compute handle from anywhere to reuse it my_cluster = rh.cluster(name="rh-4x8xA100s").up_if_not()

Iterate and debug natively

Dispatch any Python function or class to your compute. It acts like local code, passing and returning arguments, propagating exceptions, and streaming prints and logs.

# Define or import a regular Python function def train(epochs): torch.distributed.init_process_group(backend="nccl") ... return test_loss # Deploy and distribute your training function, # hot-syncing up any local code changes remote_train = rh.fn(train) \ .to(my_cluster) \ .distribute("pytorch") # Call your function as you would locally res = remote_train(epochs=3)

Distributed ML without the lift

Use distributed frameworks like Ray, PyTorch, Spark, and Dask natively without complex infra legwork to launch and manage the clusters.

# Run distributed data processing over large datasets process = rh.fn(dask_preproc) \ .to(rh_8x4_cpus) \ .distribute('dask') process(s3_path) # Execute a multinode PyTorch or Lightning training trainer = rh.module(PyTorchTraining).to(rh_8x4_gpus) remote_trainer = trainer().distribute('pytorch') remote_trainer.load_checkpoint() remote_trainer.train(epochs=1) # Use Ray Data for batch inference once the training is done batch_inference = rh.fn(ray_batch_inference) \ .to(rh_4x1gpus) \ .distribute('ray') batch_inference(s3_path)

Research-to-production, instantly

Runhouse works within regular Python, so it runs identically in your IDE, notebook, CI, or pipeline orchestrator. You can push to production and manage your ML lifecycle with familiar software best practices.

from airflow import DAG from airflow.operators.python import PythonOperator def preprocess_dask(): my_cpus = rh.cluster(name="dask_preproc", num_cpus=64) ... def train(args): my_gpus = rh.cluster(name="train_pytorch", gpus="A100:8") ... dag = DAG( "pytorch_training_pipeline", default_args=default_args, description="A simple PyTorch training DAG with multiple steps", schedule=timedelta(days=1), ) preprocess_task = PythonOperator( task_id="preprocess_task", python_callable=preprocess_dask, dag=dag, ) train_model_task = PythonOperator( task_id="train_model_task", python_callable=train, dag=dag, )

Durable execution in production

Architected to prevent cascading failures and dependency hell, and gives unique fault-tolerance and debugging capabilities in production. Escape the opaque OOMs and NCCL errors.

def preproc(batch_size): ... memory = 16 cpus = 4 tries = 3 for i in range(tries): try: compute = rh.cluster(name=f"{cpus}xCPU-cluster", memory=f"{memory}+", num_cpus=cpus, ) remote_preproc = rh.fn(preproc) \ .to(compute) \ .distribute("dask") res = remote_preproc(batch_size=64) compute.teardown() break except rh.OOM as e: compute.teardown() memory *= 2 cpus *= 2 # Load up a production cluster to debug from anywhere c = rh.cluster(name="8xCPUs") c.notebook() # Tunnel back a notebook instantly c.ssh() c.ssh_tunnel(port=8787) # tunnel back the Dask Dashboard
import runhouse as rh # Request 4 nodes with 8 GPUs each my_cluster = rh.cluster( name="rh-4x8xA100s", gpus="A100:8", num_nodes=4, image = rh.Image("base_image") \ .from_docker("nvcr.io/nvidia/pytorch:23.10-py3"), autostop_mins=60, ).up_if_not() # Load the compute handle from anywhere to reuse it my_cluster = rh.cluster(name="rh-4x8xA100s").up_if_not()

ML that just runs

Everyone wants to do AI until its time to do AI infrastructure. There's a better way.

Skip the lift

Easily integrates within your existing ML code, pipelines, and development workflows.

$pip install runhouse

"Faster, please"

Runhouse turns Python programmers into 10x ML engineers. Speed up trainings, batch inference, or hyperparameter sweeps instantly with compute autoscaling, and forget the "push and pray" debugging.

Graphic showing a laptop with a Python logo connecting to modules and functions running on cloud compute

50% off your compute bill

Runhouse optimizes resource utilization, runs entirely inside your own cloud accounts, and doesn't charge a compute markup.

List of cloud provider logos: AWS, Google, Azure, Kubernetes, AWS Lambda, and SageMaker

Use Cases

Training

PyTorchRayFine-tuningHPO

LoRA Fine-Tune Llama 3

Batch Inference

DistributedLLMsMulti-step

Multi-GPU DeepSeek-R1 with vLLM

Composite Systems

PipelinesMulti-taskEvaluation

Continual Recommender (DLRM) Training with Ray

Data Processing

RayDaskSparkData Apps

Parallel GPU Batch Embeddings

Demolish your research-to-production silos

Finally, powerful AI training that's rapid enough for development and robust enough for production.

Line illustration showing researchers and engineers connected to an AI app or code via lines before and after changes

Without Runhouse:

Research is launched on siloed compute, sampled data, and notebook code to enable iterative development. Production is reached via a slow translation to orchestrators, and becomes difficult to debug when errors arise.

Diagram showing a block with "Code development using regular SDLC" above a smaller "Compute" block and an arrow with "Runhouse manages dispatch" between

With Runhouse: Fast Software Development

Code is written and executed identically in research and production. Errors can be debugged on a branch from local IDEs and merged into production using a standard development lifecycle.

Start building on a solid ML foundation.

Book a demo