Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Atari broken in 0.7.5+ since RLlib chooses wrong neural net model by default

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
Ray installed from (source or binary): binary
Ray version: 0.7.6
Python version: 3.6.8
Exact command to reproduce: python3 train.py -f pong-appo.yaml using the rllib train.py and the tuned APPO pong yaml file.

Describe the problem

Upon finishing training (termination at 5M steps as in the config), the reward is still around -20, which is the initial reward of a random agent. The comments in the tuned example say

# This can reach 18-19 reward in ~5-7 minutes on a Titan XP GPU
# with 32 workers and 8 envs per worker. IMPALA, when ran with 
# similar configurations, solved Pong in 10-12 minutes.
# APPO can also solve Pong in 2.5 million timesteps, which is
# 2x more efficient than that of IMPALA.

which I cannot reproduce.

Training seemed to go smoothly, I didn’t see any errors, except RuntimeWarning: Mean of empty slice. and RuntimeWarning: invalid value encountered in double_scalars at the beginning of training mentioned in #5520.

Source code / logs

The final training step logs:

Result for APPO_PongNoFrameskip-v4_0:
  custom_metrics: {}
  date: 2019-10-31_18-56-51
  done: true
  episode_len_mean: 3710.01
  episode_reward_max: -18.0
  episode_reward_mean: -20.35
  episode_reward_min: -21.0
  episodes_this_iter: 88
  episodes_total: 5366
  experiment_id: e9ccd551521a44e287451f8d87dd7dbe
  hostname: test03-vgqp8
  info:
    learner:
      cur_lr: 0.0005000000237487257
      entropy: 1.7659618854522705
      mean_IS: 1.1852530241012573
      model: {}
      policy_loss: -0.003545303363353014
      var_IS: 0.21974682807922363
      var_gnorm: 23.188478469848633
      vf_explained_var: 0.0
      vf_loss: 0.01947147212922573
    learner_queue:
      size_count: 12504
      size_mean: 14.46
      size_quantiles:
      - 12.0
      - 13.0
      - 15.0
      - 16.0
      - 16.0
      size_std: 1.0432641084595982
    num_steps_replayed: 0
    num_steps_sampled: 5012800
    num_steps_trained: 9999200
    num_weight_syncs: 12532
    sample_throughput: 6554.589
    timing_breakdown:
      learner_dequeue_time_ms: 0.018
      learner_grad_time_ms: 137.841
      learner_load_time_ms: .nan
      learner_load_wait_time_ms: .nan
      optimizer_step_time_ms: 672.661
    train_throughput: 11854.045
  iterations_since_restore: 59
  node_ip: 192.168.2.40
  num_healthy_workers: 32
  off_policy_estimator: {}
  pid: 34
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_env_wait_ms: 10.214430495025196
    mean_inference_ms: 1.736408154661836
    mean_processing_ms: 0.5789328915422826
  time_since_restore: 632.1431384086609
  time_this_iter_s: 11.452256441116333
  time_total_s: 632.1431384086609
  timestamp: 1572548211
  timesteps_since_restore: 5012800
  timesteps_this_iter: 75200
  timesteps_total: 5012800
  training_iteration: 59
  trial_id: b183a16a
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/65 CPUs, 0/1 GPUs, 0.0/193.7 GiB heap, 0.0/39.6 GiB objects
Memory usage on this node: 24.5/60.0 GiB
Result logdir: /root/ray_results/pong-appo
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
 - APPO_PongNoFrameskip-v4_0:   TERMINATED, [33 CPUs, 1 GPUs], [pid=34], 632 s, 59 iter, 5012800 ts, -20.4 rew

Issue Analytics

State:
Created 4 years ago
Comments:22 (22 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, Oct 31, 2019

It seems it works in Ray 0.7.4.

RUNNING trials:
 - PPO_BreakoutNoFrameskip-v4_0_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=84978], 274 s, 73 iter, 365000 ts, 7.45 rew
 - PPO_BreakoutNoFrameskip-v4_1_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=84992], 275 s, 74 iter, 370000 ts, 9.37 rew
 - PPO_BreakoutNoFrameskip-v4_2_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=85027], 274 s, 74 iter, 370000 ts, 3.6 rew
 - PPO_BreakoutNoFrameskip-v4_3_env=BreakoutNoFrameskip-v4:     RUNNING, [11 CPUs, 1 GPUs], [pid=85018], 277 s, 75 iter, 375000 ts, 12.1 rew

But not 0.7.5+ (<= 2 reward for Breakout no matter how long).

0reactions

ericlcommented, Nov 6, 2019

@edoakes could we make sure this ends up in the latest release?

I am still testing this in https://github.com/ray-project/ray/pull/6093, it looks like APEX might still have some issues not completely solved by the patch.

Top Results From Across the Web

Learning to Play Pong — Ray 2.2.0

In this example, we'll train a very simple neural network to play Pong using the OpenAI Gym. At a high level, we will...

What should I do when my neural network doesn't learn?

When using a train/test split, the model references the original, non-split data instead of the training partition or the testing partition. Forgetting to...

Model Zoo - Deep learning code and pretrained models for ...

ModelZoo curates and provides a platform for deep learning researchers to easily find code and pre-trained models for a variety of platforms and...

Understanding deep learning requires rethinking generalization

Indeed, in neural networks, we almost always choose our model as the ... (c) shows the test error (also the generalization error since....

Newest 'rllib' Questions - Stack Overflow

I would like to train a Proximal Policy Optimization (PPO) type model using RLlib and then serve the action distribution model using Tensorflow...