Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RLlib] Trial workers are not co-located

See original GitHub issue

What is the problem?

Using Tune as a scheduler and resource manager for multiple trials - especially where each trial can fit into a single Ray node (machine) - the expected behaviour is that Tune will schedule all workers for a trial on the same node. In reality, it seems like there’s mix & match between nodes and RolloutWorker processes from different trials.

Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.7.2 Docker 19.0.3 PyTorch 1.5.1 Ubuntu 18.04.5

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

Run a Tune experiment with multiple Ray nodes (I used 10 nodes with 32 CPU cores each, so num_workers=31 leaving 1 core for driver), with the following config:

    config = DEFAULT_CONFIG
    config.update({
        "env": <your own environment of choice>,
        "use_pytorch": True,
        "num_workers": 31,
        "num_envs_per_worker": tune.grid_search([3, 5]),
   })

    tune.run(
        A3CTrainer,
        name=<experiment name>,
        stop={
            "timesteps_total": 200e6,  # Million steps
        },
        num_samples=1,
        config=config,
        checkpoint_freq=50,
        max_failures=-1
    )

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Comments:20 (20 by maintainers)

Top GitHub Comments

3reactions

roireshefcommented, Dec 17, 2020

OK, This actually works quite well now, running Tune with 27 trials (1 node x 32 cores each). Thanks a lot for your support!!

2reactions

rkooo567commented, Dec 3, 2020

Btw, there’s a high possibility this is the root cause; https://github.com/ray-project/ray/issues/12546

So GCS actor management basically distributes actors rather than colocating them when it is possible. This also explains why there was object broadcast issues coming up. Since actors are more distributed, there were more object transfers.

@ffbin will work on this issue shortly, and let’s see if this can relieve the issue;

Top Results From Across the Web

Getting Started with RLlib — Ray 3.0.0.dev0

Each RLlib algorithm has its own config class that inherits from AlgorithmConfig . For instance, to create a PPO algorithm, you start with...

Tune Internals — Ray 2.2.0 - the Ray documentation

Will not return resources if trial repeatedly fails on start. Parameters. trial – Trial to be started. Returns. True if the remote runner...

Algorithms — Ray 1.11.0

Ape-X using 32 workers in RLlib vs vanilla DQN (orange) and A3C (blue) on ... must be co-located # with the learner process...

Algorithms — Ray 2.2.0

A PlacementGroup defines, which devices (resources) should always be co-located on the same node. For example, an Algorithm with 2 rollout workers, ...

ray.tune.execution.placement_groups — Ray 3.0.0.dev0

PlacementGroupFactory([ {"CPU": 1, "GPU": 0.5, "custom_resource": 2} ]) ) ) tuner.fit() If the trial itself schedules further remote workers, the resource ...