question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] [Core] Unable to schedule fractional gpu jobs

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

Please find the minimal reproducible example below. I’m trying to run the following script on a ray cluster with two nodes and each node has 8 GPUs:

 import ray

 ray.init(address="auto")

 required_gpus = 0.6
 n_actors = 10

 @ray.remote(num_gpus=required_gpus)
 class A:
     def __init__(self, idx):
         self.idx = idx
     def f(self):
         return self.idx

 print(ray.cluster_resources())
 print("-" * 10)

 actors = [A.remote(i) for i in range(n_actors)]
 ray.get([a.f.remote() for a in actors])

The program will hang forever with the following message:

$ python test_ray.py
2021-12-07 06:40:15,355 INFO worker.py:843 -- Connecting to existing Ray cluster at address: 172.31.50.151:6379
{'object_store_memory': 308670280089.0, 'CPU': 128.0, 'accelerator_type:V100': 2.0, 'GPU': 16.0, 'memory': 710230653543.0, 'node:172.31.50.151': 1.0, 'node:172.31.53.4': 1.0}
----------
2021-12-07 06:40:33,496 WARNING worker.py:1245 -- The actor or task with ID ffffffffffffffffd01a37602c435349b99b9d1d09000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {GPU: 0.600000}, {CPU: 1.000000}
Available resources on this node: {56.000000/64.000000 CPU, 17097537210.009766 GiB/17097537210.009766 GiB memory, 3.200000/8.000000 GPU, 7536779339.990234 GiB/7536779339.990234 GiB object_store_memory, 1.000000/1.000000 accelerator_type:V100, 1.000000/1.000000 node:172.31.50.151}
 In total there are 0 pending tasks and 2 pending actors on this node.

Clearly it’s possible to schedule 10 required_gpus=0.6 actors on 16 GPUs cluster.

The program will pass when I set required_gpus=0.9 and n_actors=10, or when I set num_gpus=0.25 and n_actors=40.

I think the bug is caused by the following: After the scheduler scheduled 8 required_gpus=0.6 actors on a node, it thought the node still has 8 - 8 * 0.6 = 3.2 GPUs so it tries to schedule the actor onto the same node, but actually, it’s impossible to fit this actor on that node.

Versions / Dependencies

I tried on both v.1.9.0 and the nightly version. Both failed.

Reproduction script

See the example above.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
rkooo567commented, Feb 7, 2022

I think there are 3 possible solutions.

  1. Only allow values that match the concept of “resource instance”. For example, 0.25, 0.5, 0.125. This will be the easiest way to avoid confusion, but users won’t be able to specify values such as 0.3.
  2. Improve error messages. This is not trivial since the error messages are generated from the autoscaler now, which don’t know the details about resource instance information. In order to implement this, we should raise error messages directly from raylet.
  3. Maybe we can modify the abstraction for GPU specification to use something other than num_gpus. num_gpus have same semantics as num_cpus, and it has completely different behavior, which means it is a pretty bad abstraction. We need detailed proposal for this approach.
0reactions
scv119commented, Sep 1, 2022

@iycheng do you have bandwidth to fix this for 2.1?

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Bug] [Core] Unable to schedule fractional gpu jobs - GitHub
[Bug] [Core] Unable to schedule fractional gpu jobs · Summary · Jobs. Add Datasets issues to Data Team project · Run details. Usage...
Read more >
Generic Resource (GRES) Scheduling
Note that GRES types of GPU and MPS can not be requested within a single job. Also jobs requesting MPS resources can not...
Read more >
Submitting Jobs - HPC Docs
Basic Job Submission. The Zaratan HPC cluster uses a batch scheduling system called Slurm to handle the queuing, scheduling, and execution of jobs....
Read more >
Bug #1870736 “[nvidia] Screen scaling 125% gives 200%”
2. Select the screen setup in gnome control center. 3. Click the "Fractional scaling" toggle. 4. Select 125% and click the ...
Read more >
NVIDIA Multi-Instance GPU User Guide
With MIG, users will be able to see and schedule jobs on their new virtual GPU ... A GPU memory slice is the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found