Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] APEX DQN performance regression?

See original GitHub issue

What is the problem?

It says in the pong_apex.yaml tuned config:

# This can be expected to reach 20.8 reward within an hour when using a V100 GPU
# (e.g. p3.2xl instance on AWS, and m4.4xl workers). It also can reach ~21 reward
# within an hour with fewer workers (e.g. 4-8) but less reliably.

I trained this example on an AWS p3.2xlarge instance (4 workers, 8 vec_env per worker) but could not replicate that statement. It took 4.5 hours of training and 10M timesteps sampled and trained on to reach a mean performance of 19.

But maybe this is just the expected behavior for having less rollout workers? I don’t quite know what the expected # of samples to convergence here is.

For some comparison, training curves for Rainbow in Dopamine show good performance in 10*250k=2.5M timesteps, although certainly the algorithm and hyperparameters aren’t terribly comparable.

Here’s a full record of the run: https://app.wandb.ai/zplizzi/test/runs/2dthszrq?workspace=user-zplizzi

Ray version and other system information (Python version, TensorFlow version, OS):

Ray nightly wheels as of earlier today
Tensorflow 1.14.0
Ubuntu 16.04

Reproduction

Here’s the exact script used for training. All parameters are directly from the tuned example:

from ray.rllib.agents.dqn import ApexTrainer
from ray.rllib.agents import dqn

config = dqn.apex.APEX_DEFAULT_CONFIG.copy()

config["env"] = "PongNoFrameskip-v4"
config["monitor"] = True
config["env_config"]["wandb"] = {"project": "test", "monitor_gym": True}

config["target_network_update_freq"] = 50000
config["num_workers"] = 4
config["num_envs_per_worker"] = 8
config["gamma"] = 0.99
config["lr"] = .0001

from ray import tune
from wandb.ray import WandbLogger
tune.run(ApexTrainer,
            loggers=[WandbLogger],
            config=config,
            )

Issue Analytics

State:
Created 4 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

zplizzicommented, Dec 12, 2019

Got it, thanks!

For what it’s worth, I re-ran that test with these modified hyperparams (all else the same):

config["target_network_update_freq"] = 20000
config["lr"] = .00005
config["train_batch_size"] = 64

and it’s performing much better (almost done at 5M training steps/1.5M env steps/40 mins on the same machine). But I could imagine that the original hyperparams are better wall-clock for the 32-worker case that it’s designed for.

0reactions

zplizzicommented, Dec 12, 2019

Sure, here’s a link to the run: https://app.wandb.ai/zplizzi/test/runs/ayuuhixr?workspace=user-zplizzi

Top Results From Across the Web

Algorithms — Ray 2.2.0 - the Ray documentation

Defines a configuration class from which a DQN Algorithm can be built. Example. >>> from ray.rllib.algorithms.dqn.dqn import ...

Algorithms — Ray 1.13.0

RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive ...

RLlib Algorithms — Ray 0.8.7 documentation

[paper] [implementation] Ape-X variations of DQN and DDPG (APEX_DQN, APEX_DDPG) use a single GPU learner and many CPU workers for experience collection.

Models, Preprocessors, and Action Distributions — Ray 2.2.0

Our DQN model from above takes an observation and outputs one Q-value per (discrete) action. Continuous SAC - on the other hand -...

rllib-algorithms.rst.txt - Ray.io

_`APEX-DQN`: rllib-algorithms.html#apex . ... To mitigate these issues, CRR implements a simple and yet powerful idea of "value-filtered regression".