Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Default hyperparamters for `run_atari.py` (using P-DDDQN) fail with Pong and Breakout (log files attached)

See original GitHub issue

The default hyperparameters of baselines/baselines/deepq/experiments/run_atari.py, which presumably is the script we should be using for DQN-based models, fail to gain any noticeable reward for both Breakout and Pong. I’ve attached log files later and the steps to reproduce in this issue; the main reason why I’m filing it is that it probably makes sense to have default hyperparameters be working for the scripts that are provided. Or, alternatively, perhaps list the ones that work somewhere? Upon reading run_atari.py it seems like the number of steps is a bit low and the replay buffer should be 10x larger, but I don’t think that’s going to fix the issue since Pong should be able to learn quickly with this kind of setup.

I know this is probably not the top priority now but in theory this is easy to fix (just run it with the correct hyperparameters), and it would be great for users since running even 10 million steps (the default value right now) can take over 10 hours on a decent personal workstation. If you’re in the process of refactoring this code, is there any chance you can take this feedback into account? Thank you!

Steps to reproduce:

Use a machine with Ubuntu 16.04.
I doubt this matters, but I’m also using an NVIDIA Titan X GPU with Pascal.
Install baselines as of commit 36ee5d17071424f30071bcdc72ff11b18c577529
I used a Python 3.5 virtual environment with the following packages, with Tensorflow 1.8.0.
Enter the experiments directory: cd baselines/baselines/deepq/experiments/
Finally, run python run_atari.py with either PongNoFrameskip-v4 or BreakoutNoFrameskip-v4 as the --env argument. I kept all other parameters their default value, so this was prioritized dueling double DQN.

By default the logger in baselines will create log.txt, progress.csv, and monitor.csv files that contain information about training runs. Here are the Breakout and Pong log files:

breakout_log.txt pong_log.txt

Since GitHub doesn’t upload csv files, here are the monitor.csv files for Breakout and then Pong:

https://www.dropbox.com/s/ibl8lvub2igr9kw/breakout_monitor.csv?dl=0 https://www.dropbox.com/s/yuf3din6yjb2swl/pong_monitor.csv?dl=0

Finally, here are the progress.csv files for Breakout and the for Pong:

https://www.dropbox.com/s/79emijmnsdcjm37/breakout_progress.csv?dl=0 https://www.dropbox.com/s/b817wnlyyyriti9/pong_progress.csv?dl=0

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:34 (4 by maintainers)

Top GitHub Comments

3reactions

DanielTakeshicommented, Sep 26, 2018

@Michalos88 @skpenn @vpj @uotter Unfortunately it looks like the refactored code still runs into the same issue. The refactoring is helpful to make the interface uniform but I am guessing there are still some issues with the core DQN algorithm here. I’ll split this into three parts.

First Attempt

Using commit 4402b8eba67ed472325e9e5d49caa73c605609cf of baselines and the same machine as described in my earlier message here, I ran this command for PDD-DQN:

python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4

because that is what they tell us to run in the README: https://github.com/openai/baselines/tree/master/baselines/deepq

Unfortunately I get -20.7. The logs:

log.txt https://www.dropbox.com/s/8klj70brmhfp4i5/monitor.csv?dl=0 https://www.dropbox.com/s/2fn05ze4op2z0mn/progress.csv?dl=0

Granted these are with the hyperparameters:

Logging to /tmp/openai-2018-09-25-10-59-49-863956
env_type: atari
Training deepq on atari:PongNoFrameskip-v4 with arguments 
{'target_network_update_freq': 1000, 'gamma': 0.99, 'lr': 0.0001, 'dueling': True, 'prioritized_replay_alpha': 0.6, 'checkpoint_freq': 10000, 'learning_starts': 10000, 'train_freq': 4, 'checkpoint_path': None, 'exploration_final_eps': 0.01, 'prioritized_replay': True, 'network': 'conv_only', 'buffer_size': 10000, 'exploration_fraction': 0.1}

and with just 1M time steps by default. I think one needs around 10M for instance and then to make the buffer size larger.

Second Attempt

I then tried to use similar hyperparameters that I used for an older baselines commit (roughly 1 year ago) in which PDD-DQN easily gets at least +20 on Pong.

This is what I next ran with different hyperparameters:

(py3-baselines-sep2018) daniel@takeshi:~/baselines-sandbox$ python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=5e7 --buffer_size=50000 --lr=5e-4
Logging to /tmp/openai-2018-09-25-13-50-13-182778
Logging to /tmp/openai-2018-09-25-13-50-13-223205
env_type: atari
Training deepq on atari:PongNoFrameskip-v4 with arguments 
{'exploration_fraction': 0.1, 'learning_starts': 10000, 'checkpoint_path': None, 'lr': 0.0005, 'target_network_update_freq': 1000, 'dueling': True, 'exploration_final_eps': 0.01, 'train_freq': 4, 'prioritized_replay': True, 'buffer_size': 50000, 'prioritized_replay_alpha': 0.6, 'checkpoint_freq': 10000, 'gamma': 0.99, 'network': 'conv_only'}

The 5e7 time steps and 50k buffer size puts it more in line with what I think the older baselines code used (and which the Nature paper may have used).

The following morning (after running for about 12 hours) I noticed that after about 15M steps, the scores are still stuck at -21. PDD-DQN still doesn’t seem to learn anything. I killed the script to avoid having to run 35M more steps. Here are the logs I have:

log.txt https://www.dropbox.com/s/qi1f9de0lhnhw7a/monitor.csv?dl=0 https://www.dropbox.com/s/1odyl2reda7ncuy/progress.csv?dl=0

Note that the learning seems to collapse. Early we get plenty of -20s and -19s, which I’d expect, and then later it’s almost always -21.

Observing the Benchmarks

Note that the benchmarks for Atari they use:

http://htmlpreview.github.io/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm

show that DQN gets a score of minus seven on Pong, which is really bad but better than what I am getting here. (It also shows Breakout with a score of just one…) I am not sure what command line arguments they are using for this, but maybe it’s hidden somewhere in the code which generates the benchmarks?

@pzhokhov Since this is a fairly critical issue, is there any chance the README can be adjusted with a message like:

The DQN-based algorithms currently do not get high scores on the Atari games [see GitHub issues XX, YY, etc]. We are currently investigating this and recommend users to instead use [insert working algorithm here, e.g., PPO2].

I think this might help save some time for those who are hoping to use the DQN-based algorithms. In the meantime I can help try and figure out what the issue is, and I will also keep using my older version of baselines (from a year ago) which has the working DQN algorithms.

2reactions

pzhokhovcommented, Oct 2, 2018

Soo nothing in the commit changes jumped at me as a obvious source of error, however, I narrowed down the commits between which the breaking changes have happened. So… 2b0283b9 is still working, and 24fe3d657 is not working anymore. Bad news is that all of those are mine, so you know whom to blame 😦 Good news is that hopefully I’ll find the bug soon

Top Results From Across the Web

Deep Q-Network (DQN)-I. OpenAI Gym Pong and Wrappers

In this post we will introduce how to code a Deep Q-Network for OpenAI Gym Pong Environment.

Project report IE 534 Deep Learning

Implement one of the following deep reinforcement learning papers for Atari Games in. PyTorch: (B) “Deep exploration via bootstrapped DQN”, NIPS, ...

Human level control through deep reinforcement learning

fessional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work.

Reinforcement Learning Tips and Tricks - Stable Baselines

Read about RL and Stable Baselines · Do quantitative experiments and hyperparameter tuning if needed · Evaluate the performance using a separate test...

DQN stuck at suboptimal policy in Atari Pong task

The updated code can also be found here. No fancy changes like Prioritized Replay Buffer or any secret hyperparameter changes are required. It's ......