question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune/placement groups] trial resources not being properly allocated in tune.run

See original GitHub issue

Ray Version: Linux, Python3.7 Nightly Version - commit: 34b5781ae0d46b769e8bac400e0cbad6c60c8fcd

I am running a ray tune experiment on in an interactive slurm session w/3GPUs. I have my specified my resources_per_trial={'cpu' : 2, 'gpu' : 1} and am running two trials concurrently. However, I am seeing an unexpected behavior where two trials are being allocated to the same GPU.

Here are the output of ray.available_resources() prior to the tune.run call: Available resources: {'accelerator_type:GTX': 1.0, 'memory': 72773644084.0, 'GPU': 3.0, 'node:172.24.67.146': 1.0, 'object_store_memory': 35474418892.0, 'CPU': 32.0}

Inside two different trials, the output of ray.get_gpu_ids() and ray.get_resource_ids() is as follows:

(pid=1040823) GPU ids: ['0']
(pid=1040823) Resource ids: {'CPU_group_4d2371f34840192a944062821327f2ea': [(0, 2.0)], 'GPU_group_0_4d2371f34840192a944062821327f2ea': [(0, 1.0)], 'GPU_group_4d2371f34840192a944062821327f2ea': [(0, 1.0)], 'CPU_group_0_4d2371f34840192a944062821327f2ea': [(0, 2.0)]}
(pid=1040826) GPU ids: ['0']
(pid=1040826) Resource ids: {'CPU_group_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 2.0)], 'GPU_group_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 1.0)], 'GPU_group_0_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 1.0)], 'CPU_group_0_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 2.0)]}

I was able to avoid the issue by setting the following environment variable: os.environ["TUNE_PLACEMENT_GROUP_AUTO_DISABLED"] = "1"

cc: @richardliaw

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Mar 18, 2021

This seems to be a regression as of ray==1.2.0

import ray
import time

ray.init(num_gpus=2)

@ray.remote(num_gpus=1, num_cpus=0)
def f():
    import os
    return os.environ["CUDA_VISIBLE_DEVICES"]

pg1 = ray.util.placement_group([{"GPU": 1}])
pg2 = ray.util.placement_group([{"GPU": 1}])
print(ray.get(f.options(placement_group=pg1).remote()))
print(ray.get(f.options(placement_group=pg2).remote()))

1.2: 0 0

1.1: 1 0

0reactions
wuisawesomecommented, Apr 2, 2021

Closed by #15049

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding resource allocation tune.run vs. agents.trainer ...
I was trying to run a Population-based Training using parallel trials and it worked only when passing the overall resources to ray.init and...
Read more >
A Guide To Parallelism and Resources - the Ray documentation
The resources specified this way will only be allocated for scheduling Tune trials. These resources will not be enforced on your objective function...
Read more >
Execution (Tuner, tune.Experiment) — Ray 2.2.0
Tuner is the recommended way of launching hyperparameter tuning jobs with Ray Tune. Parameters. trainable – The trainable to be tuned. param_space –...
Read more >
Ray exec multiple scripts w/ tune.run() to same ray cluster
@Alex I am setting max concurrent = 1 for each tune.run() so they are not hogging all the resources. Rather launching new runs...
Read more >
Running Basic Experiments — Ray 2.2.0
Step 3: Optionally, configure the resources allocated per trial. Tune uses this resources allocation to control the parallelism.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found