Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Nearly no parallelization while Training PPOAgent

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
Ray installed from (source or binary): binary
Ray version: 0.6.2
Python version: 3.6.7
Exact command to reproduce:

Describe the problem

I am training an PPOAgent with a custom single agent environment on a Kubernetes cluster on AWS with one head node and 3 worker nodes. Each of them has 3500 mCPU requested and is limited to that. When I start the training usually only the head node seems to use more than one CPU. The 3 worker nodes are using a maximum of 1 CPU. iIs it a property of PPO or is this a bug? Or did I just miss a thing?

Source code / logs

import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env


def env_creator(env_config):
    import gym
    import simple_beer_game
    env = gym.make('SimpleBeerGame-v1')
    return env

ray.init(redis_address='localhost:6379') #this code is executed on the head node, where redis is running

config = ppo.DEFAULT_CONFIG.copy()
config['env_config'] = {}
config['gamma'] = 0.9
config['model']['conv_filters'] = None
config['model']['fcnet_activation'] = 'relu'
config['num_workers'] = 1
config['model']['fcnet_hiddens'] = [50, 100, 100]
register_env("SimpleBeerGame", env_creator)

agent = ppo.PPOAgent(config=config, env="SimpleBeerGame")

for i in range(100):
    result = agent.train()

Ray is started with this command on the head node:

ray start --block --head --no-ui --redis-port "${REDIS_PORT}" --object-manager-port "${OBJECT_MANAGER_PORT}" --node-manager-port "${NODE_MANAGER_PORT}"

and this command on the worker nodes:

ray start --block --redis-address "${RAY_HEAD_SVC}":"${REDIS_PORT}" --object-manager-port "${OBJECT_MANAGER_PORT}" --node-manager-port "${NODE_MANAGER_PORT}"

This shows the CPU usage while Training

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:18 (10 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, Feb 1, 2019

I just tried this out on a cluster: rllib train --env=CartPole-v0 --run=PPO --config=‘{“num_workers”: 15, “train_batch_size”: 400000}’ --redis-address=localhost:6379

And saw ray_PolicyEvaluator processes using CPU on all nodes. However, there was a bias towards the head node until I increased sample_batch_size->400000. This probably just because the with smaller batch size most CPU is used by TensorFlow doing SGD, and that only happens on the head node.

So perhaps you just need to increase train_batch_size? I think it’s inherent though that PPO will have some amount of extra CPU on the head node due to its use of synchronous optimization. You can also try out APPO which uses the IMPALA async strategy.

0reactions

MariusDannercommented, Feb 1, 2019

Ok! Thank you again!

Top Results From Across the Web

PPO trainer eating up memory - RLlib - Ray

Hi there, I'm trying to train a PPO agent via self play in my multi-agent env. At the moment it can manage about...

Getting Started with RLlib — Ray 2.2.0 - the Ray documentation

For more advanced evaluation functionality, refer to Customized Evaluation During Training. Note. Each algorithm has specific hyperparameters that can be set ...

Algorithms — Ray 2.2.0 - the Ray documentation

Sets the training related configuration. Parameters. beta – Scaling of advantages in exponential terms. When beta is 0.0, MARWIL is reduced to behavior...

Algorithms — Ray 1.11.0

[paper] [implementation] In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's...

Newest 'ray' Questions - Page 4 - Stack Overflow

I am trying to parallelize processes using Ray in a docker container. from ... worker died when training rllib agent with 3D shape...