Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune/placement groups] trial resources not being properly allocated in tune.run

See original GitHub issue

Ray Version: Linux, Python3.7 Nightly Version - commit: 34b5781ae0d46b769e8bac400e0cbad6c60c8fcd

I am running a ray tune experiment on in an interactive slurm session w/3GPUs. I have my specified my resources_per_trial={'cpu' : 2, 'gpu' : 1} and am running two trials concurrently. However, I am seeing an unexpected behavior where two trials are being allocated to the same GPU.

Here are the output of ray.available_resources() prior to the tune.run call: Available resources: {'accelerator_type:GTX': 1.0, 'memory': 72773644084.0, 'GPU': 3.0, 'node:172.24.67.146': 1.0, 'object_store_memory': 35474418892.0, 'CPU': 32.0}

Inside two different trials, the output of ray.get_gpu_ids() and ray.get_resource_ids() is as follows:

(pid=1040823) GPU ids: ['0']
(pid=1040823) Resource ids: {'CPU_group_4d2371f34840192a944062821327f2ea': [(0, 2.0)], 'GPU_group_0_4d2371f34840192a944062821327f2ea': [(0, 1.0)], 'GPU_group_4d2371f34840192a944062821327f2ea': [(0, 1.0)], 'CPU_group_0_4d2371f34840192a944062821327f2ea': [(0, 2.0)]}
(pid=1040826) GPU ids: ['0']
(pid=1040826) Resource ids: {'CPU_group_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 2.0)], 'GPU_group_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 1.0)], 'GPU_group_0_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 1.0)], 'CPU_group_0_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 2.0)]}

I was able to avoid the issue by setting the following environment variable: os.environ["TUNE_PLACEMENT_GROUP_AUTO_DISABLED"] = "1"

cc: @richardliaw

Issue Analytics

State:
Created 3 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, Mar 18, 2021

This seems to be a regression as of ray==1.2.0

import ray
import time

ray.init(num_gpus=2)

@ray.remote(num_gpus=1, num_cpus=0)
def f():
    import os
    return os.environ["CUDA_VISIBLE_DEVICES"]

pg1 = ray.util.placement_group([{"GPU": 1}])
pg2 = ray.util.placement_group([{"GPU": 1}])
print(ray.get(f.options(placement_group=pg1).remote()))
print(ray.get(f.options(placement_group=pg2).remote()))

1.2: 0 0

1.1: 1 0

0reactions

wuisawesomecommented, Apr 2, 2021

Closed by #15049

Top Results From Across the Web

Understanding resource allocation tune.run vs. agents.trainer ...

I was trying to run a Population-based Training using parallel trials and it worked only when passing the overall resources to ray.init and...

A Guide To Parallelism and Resources - the Ray documentation

The resources specified this way will only be allocated for scheduling Tune trials. These resources will not be enforced on your objective function...

Execution (Tuner, tune.Experiment) — Ray 2.2.0

Tuner is the recommended way of launching hyperparameter tuning jobs with Ray Tune. Parameters. trainable – The trainable to be tuned. param_space –...

Ray exec multiple scripts w/ tune.run() to same ray cluster

@Alex I am setting max concurrent = 1 for each tune.run() so they are not hogging all the resources. Rather launching new runs...

Running Basic Experiments — Ray 2.2.0

Step 3: Optionally, configure the resources allocated per trial. Tune uses this resources allocation to control the parallelism.