[rllib] Atari broken in 0.7.5+ since RLlib chooses wrong neural net model by default
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
- Ray installed from (source or binary): binary
- Ray version: 0.7.6
- Python version: 3.6.8
- Exact command to reproduce:
python3 train.py -f pong-appo.yaml
using the rllib train.py and the tuned APPO pong yaml file.
Describe the problem
Upon finishing training (termination at 5M steps as in the config), the reward is still around -20, which is the initial reward of a random agent. The comments in the tuned example say
# This can reach 18-19 reward in ~5-7 minutes on a Titan XP GPU
# with 32 workers and 8 envs per worker. IMPALA, when ran with
# similar configurations, solved Pong in 10-12 minutes.
# APPO can also solve Pong in 2.5 million timesteps, which is
# 2x more efficient than that of IMPALA.
which I cannot reproduce.
Training seemed to go smoothly, I didn’t see any errors, except RuntimeWarning: Mean of empty slice.
and RuntimeWarning: invalid value encountered in double_scalars
at the beginning of training mentioned in #5520.
Source code / logs
The final training step logs:
Result for APPO_PongNoFrameskip-v4_0:
custom_metrics: {}
date: 2019-10-31_18-56-51
done: true
episode_len_mean: 3710.01
episode_reward_max: -18.0
episode_reward_mean: -20.35
episode_reward_min: -21.0
episodes_this_iter: 88
episodes_total: 5366
experiment_id: e9ccd551521a44e287451f8d87dd7dbe
hostname: test03-vgqp8
info:
learner:
cur_lr: 0.0005000000237487257
entropy: 1.7659618854522705
mean_IS: 1.1852530241012573
model: {}
policy_loss: -0.003545303363353014
var_IS: 0.21974682807922363
var_gnorm: 23.188478469848633
vf_explained_var: 0.0
vf_loss: 0.01947147212922573
learner_queue:
size_count: 12504
size_mean: 14.46
size_quantiles:
- 12.0
- 13.0
- 15.0
- 16.0
- 16.0
size_std: 1.0432641084595982
num_steps_replayed: 0
num_steps_sampled: 5012800
num_steps_trained: 9999200
num_weight_syncs: 12532
sample_throughput: 6554.589
timing_breakdown:
learner_dequeue_time_ms: 0.018
learner_grad_time_ms: 137.841
learner_load_time_ms: .nan
learner_load_wait_time_ms: .nan
optimizer_step_time_ms: 672.661
train_throughput: 11854.045
iterations_since_restore: 59
node_ip: 192.168.2.40
num_healthy_workers: 32
off_policy_estimator: {}
pid: 34
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_env_wait_ms: 10.214430495025196
mean_inference_ms: 1.736408154661836
mean_processing_ms: 0.5789328915422826
time_since_restore: 632.1431384086609
time_this_iter_s: 11.452256441116333
time_total_s: 632.1431384086609
timestamp: 1572548211
timesteps_since_restore: 5012800
timesteps_this_iter: 75200
timesteps_total: 5012800
training_iteration: 59
trial_id: b183a16a
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/65 CPUs, 0/1 GPUs, 0.0/193.7 GiB heap, 0.0/39.6 GiB objects
Memory usage on this node: 24.5/60.0 GiB
Result logdir: /root/ray_results/pong-appo
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
- APPO_PongNoFrameskip-v4_0: TERMINATED, [33 CPUs, 1 GPUs], [pid=34], 632 s, 59 iter, 5012800 ts, -20.4 rew
Issue Analytics
- State:
- Created 4 years ago
- Comments:22 (22 by maintainers)
Top Results From Across the Web
Learning to Play Pong — Ray 2.2.0
In this example, we'll train a very simple neural network to play Pong using the OpenAI Gym. At a high level, we will...
Read more >What should I do when my neural network doesn't learn?
When using a train/test split, the model references the original, non-split data instead of the training partition or the testing partition. Forgetting to...
Read more >Model Zoo - Deep learning code and pretrained models for ...
ModelZoo curates and provides a platform for deep learning researchers to easily find code and pre-trained models for a variety of platforms and...
Read more >Understanding deep learning requires rethinking generalization
Indeed, in neural networks, we almost always choose our model as the ... (c) shows the test error (also the generalization error since....
Read more >Newest 'rllib' Questions - Stack Overflow
I would like to train a Proximal Policy Optimization (PPO) type model using RLlib and then serve the action distribution model using Tensorflow...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It seems it works in Ray 0.7.4.
But not 0.7.5+ (<= 2 reward for Breakout no matter how long).
@edoakes could we make sure this ends up in the latest release?
I am still testing this in https://github.com/ray-project/ray/pull/6093, it looks like APEX might still have some issues not completely solved by the patch.