Some concerns over my results and benchmarks for DQN-based agents.
See original GitHub issueHi Adam,
Thanks again for this great library. I recently did some benchmarking of DQN and PD-DQN as a sanity check, and I have a few concerns and questions about the results, so I thought I would check in with you. Some rewards I get seem to be substantially lower than I expected (but maybe ReturnAverage reports clipped reward?). For reference, here’s the DQN, D-DQN, and PD-DQN results:
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf https://arxiv.org/pdf/1509.06461.pdf https://arxiv.org/pdf/1511.05952.pdf
I’m mostly using the prioritized replay paper for reporting results, because that one also includes vanilla DQN.
Installation and setup:
I am using this commit: https://github.com/astooke/rlpyt/commit/75e96cda433626868fd2a30058be67b99bbad810
and then made the following changes with git diff:
(rlpyt-astooke) seita@stout:~/rlpyt (master) $ git diff
diff --git a/examples/example_5.py b/examples/example_5.py
index eac85ab..09f42e0 100644
--- a/examples/example_5.py
+++ b/examples/example_5.py
@@ -13,7 +13,7 @@ from rlpyt.samplers.parallel.gpu.collectors import GpuWaitResetCollector
from rlpyt.envs.atari.atari_env import AtariEnv
from rlpyt.algos.dqn.dqn import DQN
from rlpyt.agents.dqn.atari.atari_dqn_agent import AtariDqnAgent
-from rlpyt.runners.minibatch_rl import MinibatchRlEval
+from rlpyt.runners.minibatch_rl import (MinibatchRlEval, MinibatchRl)
from rlpyt.utils.logging.context import logger_context
@@ -38,16 +38,17 @@ def build_and_train(game="pong", run_ID=0, cuda_idx=None, n_parallel=2):
)
algo = DQN(**config["algo"]) # Run with defaults.
agent = AtariDqnAgent()
- runner = MinibatchRlEval(
+ #runner = MinibatchRlEval(
+ runner = MinibatchRl(
algo=algo,
agent=agent,
sampler=sampler,
- n_steps=50e6,
+ n_steps=10e6,
log_interval_steps=1e3,
affinity=dict(cuda_idx=cuda_idx, workers_cpus=list(range(n_parallel))),
)
name = "dqn_" + game
- log_dir = "example_5"
+ log_dir = "example_5_" + game
with logger_context(log_dir, run_ID, name, config):
runner.train()
There is actually one more change, I made my env rlpyt-astooke
rather than rlpyt
since I am doing some edits to myself in a different repository. However, you can see above that:
- I am using MinibatchRL, because I want to see online performance, as that’s the evaluation metric I am most used to, and which is reported in most Deepmind papers, I think.
- I am using 10M steps, not 50M, because I don’t have Deepmind-level compute.
That’s it. Next, I ran these commands:
python examples/example_5.py --n_parallel 4 --game pong --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4 --game pong --cuda_idx 1 --run_ID 1
python examples/example_5.py --n_parallel 4 --game breakout --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4 --game breakout --cuda_idx 1 --run_ID 1
python examples/example_5.py --n_parallel 4 --game boxing --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4 --game boxing --cuda_idx 1 --run_ID 1
python examples/example_5.py --n_parallel 4 --game space_invaders --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4 --game space_invaders --cuda_idx 1 --run_ID 1
python examples/example_5.py --n_parallel 4 --game seaquest --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4 --game seaquest --cuda_idx 1 --run_ID 1
This gives me two random seeds for each game.
I then decided to re-run one of the Breakout commands, this time for 50M steps, but otherwise using the same setting as above.
Next, I am also doing some slight benchmarks for PD-DQN. For that the only change I do is this:
def build_and_train(game="pong", run_ID=0, cuda_idx=None, n_parallel=2):
config = dict(
env=dict(game=game),
- algo=dict(batch_size=128),
+ algo=dict(batch_size=128, double_dqn=True, prioritized_replay=True),
sampler=dict(batch_T=2, batch_B=32),
)
I.e., just add double_dqn=True and prioritized_replay=True, using the default hyperparameters for DQN.
Data and Plots
My DQN data can be found here, with two runs/seeds for each game in each folder (there are two breakouts, the first is 10M steps, the second is 50M steps but with only one run/seed).
(rlpyt-astooke) seita@stout:~/rlpyt (master) $ ls -lh data/local/20200105/
total 12K
drwxrwxr-x 4 seita seita 4.0K Jan 5 19:05 example_5_boxing
drwxrwxr-x 4 seita seita 4.0K Jan 5 08:04 example_5_breakout
drwxrwxr-x 4 seita seita 4.0K Jan 5 12:29 example_5_pong
(rlpyt-astooke) seita@stout:~/rlpyt (master) $ ls -lh data/local/20200106/
total 12K
drwxrwxr-x 3 seita seita 4.0K Jan 6 18:09 example_5_breakout
drwxrwxr-x 4 seita seita 4.0K Jan 6 09:19 example_5_seaquest
drwxrwxr-x 4 seita seita 4.0K Jan 6 04:51 example_5_space_invaders
(rlpyt-astooke) seita@stout:~/rlpyt (master) $
A similar directory exists for my PD-DQN run with Breakout.
I then have this simple plotting script I put in the repository at the top level.
(rlpyt-astooke) seita@stout:~/rlpyt (master) $ cat plot_csv_training.py
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import os
import pickle
import numpy as np
import pandas as pd
from os.path import join
from collections import defaultdict
# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
er_alpha = 0.25
def plot_csv(args, exps, plot_wrt_steps=False):
"""Plot from the progress csv file, only if in `col_to_include`.
The `exp` should NOT include the run_0, etc stuff. Because, we want exp to
be able to average across all of those runs.
"""
col_to_include = [
#'CumSteps',
#'CumTime (s)',
'ReturnAverage',
#'StepsPerSecond',
]
col_to_include = sorted(col_to_include)
progfiles = []
dfs = []
for exp in exps:
progfile = join(exp, 'progress.csv')
progfiles.append(progfile)
df = pd.read_csv(progfile, delimiter = ',')
df = df.reindex(sorted(df.columns), axis=1)
print("loaded csv, with shape {}".format(df.shape))
dfs.append(df)
print("including these {} columns: {}".format(len(col_to_include), col_to_include))
# Next, get the plot set up, one row per statistic?
nrows, ncols = len(col_to_include), 1
fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
figsize=(13*ncols,4*nrows))
title = os.path.basename(exp)
k= 10
row = 0
for column in dfs[0]:
if column not in col_to_include:
continue
print(row, column)
# Now take the mean and std.
# Or we can just put a bunch of them together.
for df in dfs:
data = df[column].tolist()
label = 'avg last {}, {:.3f}'.format(k, np.mean(data[-k:]))
if plot_wrt_steps:
csteps = np.array(df['CumSteps'].tolist()) / 1e6
ax[row,0].plot(csteps, data, label=label)
ax[row,0].set_ylabel('1e6 CumSteps', fontsize=ysize)
else:
ax[row,0].plot(data, label=label)
ax[row,0].set_title(column, fontsize=titlesize)
ax[row,0].tick_params(axis='x', labelsize=ticksize)
ax[row,0].tick_params(axis='y', labelsize=ticksize)
leg = ax[row,0].legend(loc="best", ncol=1, prop={'size':legendsize})
for legobj in leg.legendHandles:
legobj.set_linewidth(5.0)
row += 1
plt.tight_layout()
# Do some hacks, very specific to the file directory
_head, _tail = os.path.split(exp) # _tail is 'run_1' if we ran two trials, 'run_0' if one, etc.
_, _script = os.path.split(_head) # e.g., example_5_space_invaders
_gamename = (_script).replace('example_5_','') # hacky
print(_head, _gamename)
figname = join(_head,'{}_rlpyt.png'.format(_gamename))
plt.savefig(figname)
print("Just saved: {}\n".format(figname))
if __name__ == "__main__":
pp = argparse.ArgumentParser()
pp.add_argument('path', type=str)
args = pp.parse_args()
assert args.path is not None
exps = sorted(
[join(args.path,x) for x in os.listdir(args.path) if 'run_' in x]
)
print('\nPlotting:')
for exp in exps:
print('\t',exp)
plot_csv(args, exps, plot_wrt_steps=True)
I just run the following commands to plot:
DQN results:
python plot_csv_training.py data/local/20200105/example_5_pong/
python plot_csv_training.py data/local/20200105/example_5_boxing/
python plot_csv_training.py data/local/20200105/example_5_breakout/ (10M steps)
python plot_csv_training.py data/local/20200106/example_5_breakout/ (50M steps)
python plot_csv_training.py data/local/20200106/example_5_space_invaders/
python plot_csv_training.py data/local/20200106/example_5_seaquest/
PD-DQN:
python plot_csv_training.py data/local/20200110/example_5_breakout/
Results
Here are my results.
DQN
Pong:
Breakout, 10M steps:
Breakout, 50M steps:
Boxing:
Seaquest:
Space Invaders:
PD-DQN
Thoughts / Comments
-
Pong: seems to be fine, gets 20+ as usual, quickly, no issues. Running 10M steps is fine, no need for 50M.
-
Boxing: gets to roughly 80 points, which is actually slightly higher than the values reported in the PD-DQN appendix, which are roughly 68-75 depending on which value is used. Boxing has a hard limit of 100, I think, and most learning curves show rapid improvement and then stagnation, like in Pong, so running just 10M steps is probably fine.
-
Breakout: my biggest concern here. The DQN results seem to be stuck at below 100, and running 50M steps did not show any improvement, reward continued to stagnate. Running PD-DQN seems to make things even worse, with rewards stuck at roughly 50 after 10M steps and it’s unclear if the curves would increase given more time. The PD-DQN paper reports Breakout getting at least 300+ points for a variety of settings, with the lowest value (of around 300) coming with vanilla DQN. This is still much better than what I am seeing. Breakout is another one of those games where most learning curves I see show rapid improvement over 10M steps and then stagnation from roughly 10M to 50M steps, and I’m not sure if running for 50M steps would help. (And in any case I did one trial with 50M steps.)
-
Seaquest: while my curves are improving at the 10M mark, the reward is much too low, PD-DQN reports getting values of roughly 10k to 40k. In addition, random performance gets 215.5 reward.
-
Space Invaders: stuck at 60 points, whereas PD-DQN gets several thousand, with a value of 9000+ reported for the proportional PER setting. Also, random performance gets 182.6 points…
Some more thoughts:
- My biggest concern here is Breakout. I’m not sure why the ReturnAverage is so low, especially because Breakout is supposed to be perhaps the second-most reliable Atari environment after Pong.
- It could be a hyper-parameter issue, but the main difference, I think, is that with Deepmind’s results, they report the serial, single-thread case of 50M steps. Is that really enough for a huge enough point gap?
- Is the policy epsilon decaying in a similar manner as in Deepmind papers, with epsilon annealed from 1.0 to 0.1 over the first 1 million steps? (Then perhaps decayed further to 0.01?)
- Another possibility could be that the return itself isn’t the actual return from the environment, but the clipped reward. I ran into this problem in other code bases so I want to check that this is not the issue. I know for Boxing, the natural rewards should have absolute value 1, just like in Pong, but the other games have rewards that are higher and need to be clipped.
I am going to be running some more benchmarks, so stay tuned for additional results …
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:6 (6 by maintainers)
Top GitHub Comments
Hi @astooke There is some better news now. I ran:
with the changes:
AtariTrajInfo
as theTrajInfoCls
.I ran the above, one for a 10M step run (run_ID 0), and the other for a 50M step run (run_ID 1, and the 50M step is the setting Deepmind usually uses). My results are in the data/local/20200117 folder.
I ran this command:
Here is the result:
I think we can confidently proceed with the DQN-based algorithms on Atari. I’m happy to close this issue but I strongly recommend changing all the Atari-based examples to use the correct TrajInfoCls, it doesn’t seem like anyone would want to use the default TrajInfo class.
@astooke I think I finally figured out a simpler way. Here’s the example_5 script which was what I was using above:
https://github.com/astooke/rlpyt/blob/75e96cda433626868fd2a30058be67b99bbad810/examples/example_5.py#L26-L38
and here’s a script deeper into the file hierarchy that explicitly defines the TrajInfoCls!
https://github.com/astooke/rlpyt/blob/75e96cda433626868fd2a30058be67b99bbad810/rlpyt/experiments/scripts/atari/dqn/train/atari_dqn_cpu.py#L24-L32
When I ran something like that, I get a
GameScore
reported in the logger. Let me run this to completion and see how that works.