main
version. Click here to see docs for the latest stable version.On your Runhouse cluster, whether you have one node or multiple nodes, you may want to run things in different processes on the cluster.
There are a few key use cases for separating your logic into different processes:
Creating processes that require certain amounts of resources.
Creating processes on specific nodes.
Creating processes with specific environment variables.
General OS process isolation – allowing you to kill things on the cluster without touching other running logic.
You can put your Runhouse Functions/Modules into specific processes, or even run bash commands in specific processes.
Let’s set up a basic cluster and some easy logic to send to it.
def see_process_attributes(): import os import time import socket log_level = os.environ.get("LOG_LEVEL") if log_level == "DEBUG": print("Debugging...") else: print("No log level set.") # Return the IP that this is scheduled on return socket.gethostbyname(socket.gethostname())
import runhouse as rh cluster = rh.cluster(name="multi-gpu-cluster", accelerators="A10G:1", num_nodes=2, provider="aws").up_if_not()
I 12-17 13:12:17 provisioner.py:560] Successfully provisioned cluster: multi-gpu-cluster
I 12-17 13:12:18 cloud_vm_ray_backend.py:3402] Run commands not specified or empty.
Clusters
AWS: Fetching availability zones mapping...NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
multi-gpu-cluster a few secs ago 2x AWS(g5.xlarge, {'A10G': 1}) UP (down) /Users/rohinbhasin/minico...
ml_ready_cluster 1 hr ago 1x AWS(m6i.large, image_id={'us-east-1': 'docker:python:3.12.8-bookwor... UP (down) /Users/rohinbhasin/minico...
[?25h
We can now create processes based on whatever requirements we want. Covering all the examples above:
# Create some processes with GPU requirements. These will be on different nodes since each node only has one GPU, and we'll check that p1 = cluster.ensure_process_created("p1", compute={"GPU": 1}) # This second process will also have an env var set. p2 = cluster.ensure_process_created("p2", compute={"GPU": 1}, env_vars={"LOG_LEVEL": "DEBUG"}) # We can also send processes to specific nodes if we want p3 = cluster.ensure_process_created("p3", compute={"node_idx": 1}) cluster.list_processes()
{'default_process': {'name': 'default_process',
'compute': {},
'runtime_env': None,
'env_vars': {}},
'p1': {'name': 'p1',
'compute': {'GPU': 1},
'runtime_env': {},
'env_vars': None},
'p2': {'name': 'p2',
'compute': {'GPU': 1},
'runtime_env': {},
'env_vars': {'LOG_LEVEL': 'DEBUG'}},
'p3': {'name': 'p3',
'compute': {'node_idx': 1},
'runtime_env': {},
'env_vars': None}}
Note that we always create a default_process
, which is where all
Runhouse Functions/Modules end up if you don’t specify processes when
sending them to the cluster. This default_process
always lives on
the head node of your cluster.
Now, let’s see where these processes ended up using our utility method set up above.
remote_f1 = rh.function(see_process_attributes).to(cluster, process=p1) print(remote_f1())
INFO | 2024-12-17 13:23:01 | runhouse.resources.functions.function:236 | Because this function is defined in a notebook, writing it out to /Users/rohinbhasin/work/notebooks/see_process_attributes_fn.py to make it importable. Please make sure the function does not rely on any local variables, including imports (which should be moved inside the function body). This restriction does not apply to functions defined in normal Python files.
INFO | 2024-12-17 13:23:04 | runhouse.resources.module:507 | Sending module see_process_attributes of type <class 'runhouse.resources.functions.function.Function'> to multi-gpu-cluster
INFO | 2024-12-17 13:23:04 | runhouse.servers.http.http_client:439 | Calling see_process_attributes.call
No log level set.
INFO | 2024-12-17 13:23:04 | runhouse.servers.http.http_client:504 | Time to call see_process_attributes.call: 0.71 seconds
172.31.89.87
remote_f2 = rh.function(see_process_attributes).to(cluster, process=p2) print(remote_f2())
INFO | 2024-12-17 13:23:32 | runhouse.resources.functions.function:236 | Because this function is defined in a notebook, writing it out to /Users/rohinbhasin/work/notebooks/see_process_attributes_fn.py to make it importable. Please make sure the function does not rely on any local variables, including imports (which should be moved inside the function body). This restriction does not apply to functions defined in normal Python files.
INFO | 2024-12-17 13:23:34 | runhouse.resources.module:507 | Sending module see_process_attributes of type <class 'runhouse.resources.functions.function.Function'> to multi-gpu-cluster
INFO | 2024-12-17 13:23:34 | runhouse.servers.http.http_client:439 | Calling see_process_attributes.call
Debugging...
INFO | 2024-12-17 13:23:35 | runhouse.servers.http.http_client:504 | Time to call see_process_attributes.call: 0.53 seconds
172.31.94.40
We can see that, since each process required one GPU, they were scheduled on different machines. You can also see that the environment variable we set in the second process was propagated, as our logging output is different. Let’s check now that the 3rd process we explicitly sent to the second node is on the second node.”
remote_f3 = rh.function(see_process_attributes).to(cluster, process=p3) print(remote_f3())
INFO | 2024-12-17 13:27:05 | runhouse.resources.functions.function:236 | Because this function is defined in a notebook, writing it out to /Users/rohinbhasin/work/notebooks/see_process_attributes_fn.py to make it importable. Please make sure the function does not rely on any local variables, including imports (which should be moved inside the function body). This restriction does not apply to functions defined in normal Python files.
INFO | 2024-12-17 13:27:08 | runhouse.resources.module:507 | Sending module see_process_attributes of type <class 'runhouse.resources.functions.function.Function'> to multi-gpu-cluster
INFO | 2024-12-17 13:27:08 | runhouse.servers.http.http_client:439 | Calling see_process_attributes.call
No log level set.
INFO | 2024-12-17 13:27:08 | runhouse.servers.http.http_client:504 | Time to call see_process_attributes.call: 0.54 seconds
172.31.94.40
Success! We can also run_bash
within a specific process, if we want
to make sure our bash command runs on the same node as a function we’re
running.
cluster.run_bash("ip addr", process=p2)
[[0, '1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000n link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00n inet 127.0.0.1/8 scope host lon valid_lft forever preferred_lft forevern inet6 ::1/128 scope host n valid_lft forever preferred_lft forevern2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000n link/ether 12:4c:76:66:e8:bb brd ff:ff:ff:ff:ff:ffn altname enp0s5n inet 172.31.94.40/20 brd 172.31.95.255 scope global dynamic ens5n valid_lft 3500sec preferred_lft 3500secn inet6 fe80::104c:76ff:fe66:e8bb/64 scope link n valid_lft forever preferred_lft forevern3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default n link/ether 02:42:ac:9e:2b:8f brd ff:ff:ff:ff:ff:ffn inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0n valid_lft forever preferred_lft forevern', ''], [...]]
You can see that this ran on the second node. Finally, you can also kill processes, which you may want to do if you use asyncio to run long running functions in a process.
cluster.kill_process(p3) cluster.list_processes()
{'default_process': {'name': 'default_process',
'compute': {},
'runtime_env': None,
'env_vars': {}},
'p1': {'name': 'p1',
'compute': {'GPU': 1},
'runtime_env': {},
'env_vars': None},
'p2': {'name': 'p2',
'compute': {'GPU': 1},
'runtime_env': {},
'env_vars': {'LOG_LEVEL': 'DEBUG'}}}