[tune/placement groups] trial resources not being properly allocated in tune.run
See original GitHub issueRay Version: Linux, Python3.7 Nightly Version - commit: 34b5781ae0d46b769e8bac400e0cbad6c60c8fcd
I am running a ray tune experiment on in an interactive slurm session w/3GPUs. I have my specified my resources_per_trial={'cpu' : 2, 'gpu' : 1}
and am running two trials concurrently. However, I am seeing an unexpected behavior where two trials are being allocated to the same GPU.
Here are the output of ray.available_resources()
prior to the tune.run
call:
Available resources: {'accelerator_type:GTX': 1.0, 'memory': 72773644084.0, 'GPU': 3.0, 'node:172.24.67.146': 1.0, 'object_store_memory': 35474418892.0, 'CPU': 32.0}
Inside two different trials, the output of ray.get_gpu_ids()
and ray.get_resource_ids()
is as follows:
(pid=1040823) GPU ids: ['0']
(pid=1040823) Resource ids: {'CPU_group_4d2371f34840192a944062821327f2ea': [(0, 2.0)], 'GPU_group_0_4d2371f34840192a944062821327f2ea': [(0, 1.0)], 'GPU_group_4d2371f34840192a944062821327f2ea': [(0, 1.0)], 'CPU_group_0_4d2371f34840192a944062821327f2ea': [(0, 2.0)]}
(pid=1040826) GPU ids: ['0']
(pid=1040826) Resource ids: {'CPU_group_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 2.0)], 'GPU_group_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 1.0)], 'GPU_group_0_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 1.0)], 'CPU_group_0_4acb05de24ad14cfada9a074c6f6e5f5': [(0, 2.0)]}
I was able to avoid the issue by setting the following environment variable: os.environ["TUNE_PLACEMENT_GROUP_AUTO_DISABLED"] = "1"
cc: @richardliaw
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
This seems to be a regression as of ray==1.2.0
1.2: 0 0
1.1: 1 0
Closed by #15049