question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RLlib] Trial workers are not co-located

See original GitHub issue

What is the problem?

Using Tune as a scheduler and resource manager for multiple trials - especially where each trial can fit into a single Ray node (machine) - the expected behaviour is that Tune will schedule all workers for a trial on the same node. In reality, it seems like there’s mix & match between nodes and RolloutWorker processes from different trials.

Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.7.2 Docker 19.0.3 PyTorch 1.5.1 Ubuntu 18.04.5

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

Run a Tune experiment with multiple Ray nodes (I used 10 nodes with 32 CPU cores each, so num_workers=31 leaving 1 core for driver), with the following config:

    config = DEFAULT_CONFIG
    config.update({
        "env": <your own environment of choice>,
        "use_pytorch": True,
        "num_workers": 31,
        "num_envs_per_worker": tune.grid_search([3, 5]),
   })

    tune.run(
        A3CTrainer,
        name=<experiment name>,
        stop={
            "timesteps_total": 200e6,  # Million steps
        },
        num_samples=1,
        config=config,
        checkpoint_freq=50,
        max_failures=-1
    )

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:20 (20 by maintainers)

github_iconTop GitHub Comments

3reactions
roireshefcommented, Dec 17, 2020

OK, This actually works quite well now, running Tune with 27 trials (1 node x 32 cores each). Thanks a lot for your support!!

2reactions
rkooo567commented, Dec 3, 2020

Btw, there’s a high possibility this is the root cause; https://github.com/ray-project/ray/issues/12546

So GCS actor management basically distributes actors rather than colocating them when it is possible. This also explains why there was object broadcast issues coming up. Since actors are more distributed, there were more object transfers.

@ffbin will work on this issue shortly, and let’s see if this can relieve the issue;

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started with RLlib — Ray 3.0.0.dev0
Each RLlib algorithm has its own config class that inherits from AlgorithmConfig . For instance, to create a PPO algorithm, you start with...
Read more >
Tune Internals — Ray 2.2.0 - the Ray documentation
Will not return resources if trial repeatedly fails on start. Parameters. trial – Trial to be started. Returns. True if the remote runner...
Read more >
Algorithms — Ray 1.11.0
Ape-X using 32 workers in RLlib vs vanilla DQN (orange) and A3C (blue) on ... must be co-located # with the learner process...
Read more >
Algorithms — Ray 2.2.0
A PlacementGroup defines, which devices (resources) should always be co-located on the same node. For example, an Algorithm with 2 rollout workers, ...
Read more >
ray.tune.execution.placement_groups — Ray 3.0.0.dev0
PlacementGroupFactory([ {"CPU": 1, "GPU": 0.5, "custom_resource": 2} ]) ) ) tuner.fit() If the trial itself schedules further remote workers, the resource ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found