[rllib] APEX DQN performance regression?
See original GitHub issueWhat is the problem?
It says in the pong_apex.yaml tuned config:
# This can be expected to reach 20.8 reward within an hour when using a V100 GPU
# (e.g. p3.2xl instance on AWS, and m4.4xl workers). It also can reach ~21 reward
# within an hour with fewer workers (e.g. 4-8) but less reliably.
I trained this example on an AWS p3.2xlarge instance (4 workers, 8 vec_env per worker) but could not replicate that statement. It took 4.5 hours of training and 10M timesteps sampled and trained on to reach a mean performance of 19.
But maybe this is just the expected behavior for having less rollout workers? I don’t quite know what the expected # of samples to convergence here is.
For some comparison, training curves for Rainbow in Dopamine show good performance in 10*250k=2.5M timesteps, although certainly the algorithm and hyperparameters aren’t terribly comparable.
Here’s a full record of the run: https://app.wandb.ai/zplizzi/test/runs/2dthszrq?workspace=user-zplizzi
Ray version and other system information (Python version, TensorFlow version, OS):
- Ray nightly wheels as of earlier today
- Tensorflow 1.14.0
- Ubuntu 16.04
Reproduction
Here’s the exact script used for training. All parameters are directly from the tuned example:
from ray.rllib.agents.dqn import ApexTrainer
from ray.rllib.agents import dqn
config = dqn.apex.APEX_DEFAULT_CONFIG.copy()
config["env"] = "PongNoFrameskip-v4"
config["monitor"] = True
config["env_config"]["wandb"] = {"project": "test", "monitor_gym": True}
config["target_network_update_freq"] = 50000
config["num_workers"] = 4
config["num_envs_per_worker"] = 8
config["gamma"] = 0.99
config["lr"] = .0001
from ray import tune
from wandb.ray import WandbLogger
tune.run(ApexTrainer,
loggers=[WandbLogger],
config=config,
)
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
Got it, thanks!
For what it’s worth, I re-ran that test with these modified hyperparams (all else the same):
and it’s performing much better (almost done at 5M training steps/1.5M env steps/40 mins on the same machine). But I could imagine that the original hyperparams are better wall-clock for the 32-worker case that it’s designed for.
Sure, here’s a link to the run: https://app.wandb.ai/zplizzi/test/runs/ayuuhixr?workspace=user-zplizzi