[RLlib] Trial workers are not co-located
See original GitHub issueWhat is the problem?
Using Tune as a scheduler and resource manager for multiple trials - especially where each trial can fit into a single Ray node (machine) - the expected behaviour is that Tune will schedule all workers for a trial on the same node. In reality, it seems like there’s mix & match between nodes and RolloutWorker processes from different trials.
Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.7.2 Docker 19.0.3 PyTorch 1.5.1 Ubuntu 18.04.5
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
Run a Tune experiment with multiple Ray nodes (I used 10 nodes with 32 CPU cores each, so num_workers=31 leaving 1 core for driver), with the following config:
config = DEFAULT_CONFIG
config.update({
"env": <your own environment of choice>,
"use_pytorch": True,
"num_workers": 31,
"num_envs_per_worker": tune.grid_search([3, 5]),
})
tune.run(
A3CTrainer,
name=<experiment name>,
stop={
"timesteps_total": 200e6, # Million steps
},
num_samples=1,
config=config,
checkpoint_freq=50,
max_failures=-1
)
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:20 (20 by maintainers)

Top Related StackOverflow Question
OK, This actually works quite well now, running Tune with 27 trials (1 node x 32 cores each). Thanks a lot for your support!!
Btw, there’s a high possibility this is the root cause; https://github.com/ray-project/ray/issues/12546
So GCS actor management basically distributes actors rather than colocating them when it is possible. This also explains why there was object broadcast issues coming up. Since actors are more distributed, there were more object transfers.
@ffbin will work on this issue shortly, and let’s see if this can relieve the issue;