[Bug] [Core] Unable to schedule fractional gpu jobs
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
Please find the minimal reproducible example below. I’m trying to run the following script on a ray cluster with two nodes and each node has 8 GPUs:
import ray
ray.init(address="auto")
required_gpus = 0.6
n_actors = 10
@ray.remote(num_gpus=required_gpus)
class A:
def __init__(self, idx):
self.idx = idx
def f(self):
return self.idx
print(ray.cluster_resources())
print("-" * 10)
actors = [A.remote(i) for i in range(n_actors)]
ray.get([a.f.remote() for a in actors])
The program will hang forever with the following message:
$ python test_ray.py
2021-12-07 06:40:15,355 INFO worker.py:843 -- Connecting to existing Ray cluster at address: 172.31.50.151:6379
{'object_store_memory': 308670280089.0, 'CPU': 128.0, 'accelerator_type:V100': 2.0, 'GPU': 16.0, 'memory': 710230653543.0, 'node:172.31.50.151': 1.0, 'node:172.31.53.4': 1.0}
----------
2021-12-07 06:40:33,496 WARNING worker.py:1245 -- The actor or task with ID ffffffffffffffffd01a37602c435349b99b9d1d09000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {GPU: 0.600000}, {CPU: 1.000000}
Available resources on this node: {56.000000/64.000000 CPU, 17097537210.009766 GiB/17097537210.009766 GiB memory, 3.200000/8.000000 GPU, 7536779339.990234 GiB/7536779339.990234 GiB object_store_memory, 1.000000/1.000000 accelerator_type:V100, 1.000000/1.000000 node:172.31.50.151}
In total there are 0 pending tasks and 2 pending actors on this node.
Clearly it’s possible to schedule 10 required_gpus=0.6
actors on 16 GPUs cluster.
The program will pass when I set required_gpus=0.9
and n_actors=10
, or when I set num_gpus=0.25
and n_actors=40
.
I think the bug is caused by the following: After the scheduler scheduled 8 required_gpus=0.6
actors on a node, it thought the node still has 8 - 8 * 0.6 = 3.2
GPUs so it tries to schedule the actor onto the same node, but actually, it’s impossible to fit this actor on that node.
Versions / Dependencies
I tried on both v.1.9.0 and the nightly version. Both failed.
Reproduction script
See the example above.
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (14 by maintainers)
I think there are 3 possible solutions.
@iycheng do you have bandwidth to fix this for 2.1?