question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Nearly no parallelization while Training PPOAgent

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.6.2
  • Python version: 3.6.7
  • Exact command to reproduce:

Describe the problem

I am training an PPOAgent with a custom single agent environment on a Kubernetes cluster on AWS with one head node and 3 worker nodes. Each of them has 3500 mCPU requested and is limited to that. When I start the training usually only the head node seems to use more than one CPU. The 3 worker nodes are using a maximum of 1 CPU. iIs it a property of PPO or is this a bug? Or did I just miss a thing?

Source code / logs

import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env


def env_creator(env_config):
    import gym
    import simple_beer_game
    env = gym.make('SimpleBeerGame-v1')
    return env

ray.init(redis_address='localhost:6379') #this code is executed on the head node, where redis is running

config = ppo.DEFAULT_CONFIG.copy()
config['env_config'] = {}
config['gamma'] = 0.9
config['model']['conv_filters'] = None
config['model']['fcnet_activation'] = 'relu'
config['num_workers'] = 1
config['model']['fcnet_hiddens'] = [50, 100, 100]
register_env("SimpleBeerGame", env_creator)

agent = ppo.PPOAgent(config=config, env="SimpleBeerGame")

for i in range(100):
    result = agent.train()

Ray is started with this command on the head node:

ray start --block --head --no-ui --redis-port "${REDIS_PORT}" --object-manager-port "${OBJECT_MANAGER_PORT}" --node-manager-port "${NODE_MANAGER_PORT}"

and this command on the worker nodes:

ray start --block --redis-address "${RAY_HEAD_SVC}":"${REDIS_PORT}" --object-manager-port "${OBJECT_MANAGER_PORT}" --node-manager-port "${NODE_MANAGER_PORT}"

This shows the CPU usage while Training CPU Usage while training

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:18 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Feb 1, 2019

I just tried this out on a cluster: rllib train --env=CartPole-v0 --run=PPO --config=‘{“num_workers”: 15, “train_batch_size”: 400000}’ --redis-address=localhost:6379

And saw ray_PolicyEvaluator processes using CPU on all nodes. However, there was a bias towards the head node until I increased sample_batch_size->400000. This probably just because the with smaller batch size most CPU is used by TensorFlow doing SGD, and that only happens on the head node.

So perhaps you just need to increase train_batch_size? I think it’s inherent though that PPO will have some amount of extra CPU on the head node due to its use of synchronous optimization. You can also try out APPO which uses the IMPALA async strategy.

0reactions
MariusDannercommented, Feb 1, 2019

Ok! Thank you again!

Read more comments on GitHub >

github_iconTop Results From Across the Web

PPO trainer eating up memory - RLlib - Ray
Hi there, I'm trying to train a PPO agent via self play in my multi-agent env. At the moment it can manage about...
Read more >
Getting Started with RLlib — Ray 2.2.0 - the Ray documentation
For more advanced evaluation functionality, refer to Customized Evaluation During Training. Note. Each algorithm has specific hyperparameters that can be set ...
Read more >
Algorithms — Ray 2.2.0 - the Ray documentation
Sets the training related configuration. Parameters. beta – Scaling of advantages in exponential terms. When beta is 0.0, MARWIL is reduced to behavior...
Read more >
Algorithms — Ray 1.11.0
[paper] [implementation] In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's...
Read more >
Newest 'ray' Questions - Page 4 - Stack Overflow
I am trying to parallelize processes using Ray in a docker container. from ... worker died when training rllib agent with 3D shape...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found