Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some concerns over my results and benchmarks for DQN-based agents.

See original GitHub issue

Hi Adam,

Thanks again for this great library. I recently did some benchmarking of DQN and PD-DQN as a sanity check, and I have a few concerns and questions about the results, so I thought I would check in with you. Some rewards I get seem to be substantially lower than I expected (but maybe ReturnAverage reports clipped reward?). For reference, here’s the DQN, D-DQN, and PD-DQN results:

https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf https://arxiv.org/pdf/1509.06461.pdf https://arxiv.org/pdf/1511.05952.pdf

I’m mostly using the prioritized replay paper for reporting results, because that one also includes vanilla DQN.

Installation and setup:

I am using this commit: https://github.com/astooke/rlpyt/commit/75e96cda433626868fd2a30058be67b99bbad810

and then made the following changes with git diff:

(rlpyt-astooke) seita@stout:~/rlpyt (master) $ git diff
diff --git a/examples/example_5.py b/examples/example_5.py
index eac85ab..09f42e0 100644
--- a/examples/example_5.py
+++ b/examples/example_5.py
@@ -13,7 +13,7 @@ from rlpyt.samplers.parallel.gpu.collectors import GpuWaitResetCollector
 from rlpyt.envs.atari.atari_env import AtariEnv
 from rlpyt.algos.dqn.dqn import DQN
 from rlpyt.agents.dqn.atari.atari_dqn_agent import AtariDqnAgent
-from rlpyt.runners.minibatch_rl import MinibatchRlEval
+from rlpyt.runners.minibatch_rl import (MinibatchRlEval, MinibatchRl)
 from rlpyt.utils.logging.context import logger_context


@@ -38,16 +38,17 @@ def build_and_train(game="pong", run_ID=0, cuda_idx=None, n_parallel=2):
     )
     algo = DQN(**config["algo"])  # Run with defaults.
     agent = AtariDqnAgent()
-    runner = MinibatchRlEval(
+    #runner = MinibatchRlEval(
+    runner = MinibatchRl(
         algo=algo,
         agent=agent,
         sampler=sampler,
-        n_steps=50e6,
+        n_steps=10e6,
         log_interval_steps=1e3,
         affinity=dict(cuda_idx=cuda_idx, workers_cpus=list(range(n_parallel))),
     )
     name = "dqn_" + game
-    log_dir = "example_5"
+    log_dir = "example_5_" + game
     with logger_context(log_dir, run_ID, name, config):
         runner.train()

There is actually one more change, I made my env rlpyt-astooke rather than rlpyt since I am doing some edits to myself in a different repository. However, you can see above that:

I am using MinibatchRL, because I want to see online performance, as that’s the evaluation metric I am most used to, and which is reported in most Deepmind papers, I think.
I am using 10M steps, not 50M, because I don’t have Deepmind-level compute.

That’s it. Next, I ran these commands:

python examples/example_5.py --n_parallel 4  --game pong --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4  --game pong --cuda_idx 1 --run_ID 1
python examples/example_5.py --n_parallel 4  --game breakout --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4  --game breakout --cuda_idx 1 --run_ID 1
python examples/example_5.py --n_parallel 4  --game boxing --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4  --game boxing --cuda_idx 1 --run_ID 1
python examples/example_5.py --n_parallel 4  --game space_invaders --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4  --game space_invaders --cuda_idx 1 --run_ID 1
python examples/example_5.py --n_parallel 4  --game seaquest --cuda_idx 0 --run_ID 0
python examples/example_5.py --n_parallel 4  --game seaquest --cuda_idx 1 --run_ID 1

This gives me two random seeds for each game.

I then decided to re-run one of the Breakout commands, this time for 50M steps, but otherwise using the same setting as above.

Next, I am also doing some slight benchmarks for PD-DQN. For that the only change I do is this:

 def build_and_train(game="pong", run_ID=0, cuda_idx=None, n_parallel=2):
     config = dict(
         env=dict(game=game),
-        algo=dict(batch_size=128),
+        algo=dict(batch_size=128, double_dqn=True, prioritized_replay=True),
         sampler=dict(batch_T=2, batch_B=32),
     )

I.e., just add double_dqn=True and prioritized_replay=True, using the default hyperparameters for DQN.

Data and Plots

My DQN data can be found here, with two runs/seeds for each game in each folder (there are two breakouts, the first is 10M steps, the second is 50M steps but with only one run/seed).

(rlpyt-astooke) seita@stout:~/rlpyt (master) $ ls -lh data/local/20200105/
total 12K
drwxrwxr-x 4 seita seita 4.0K Jan  5 19:05 example_5_boxing
drwxrwxr-x 4 seita seita 4.0K Jan  5 08:04 example_5_breakout
drwxrwxr-x 4 seita seita 4.0K Jan  5 12:29 example_5_pong
(rlpyt-astooke) seita@stout:~/rlpyt (master) $ ls -lh data/local/20200106/
total 12K
drwxrwxr-x 3 seita seita 4.0K Jan  6 18:09 example_5_breakout
drwxrwxr-x 4 seita seita 4.0K Jan  6 09:19 example_5_seaquest
drwxrwxr-x 4 seita seita 4.0K Jan  6 04:51 example_5_space_invaders
(rlpyt-astooke) seita@stout:~/rlpyt (master) $

A similar directory exists for my PD-DQN run with Breakout.

I then have this simple plotting script I put in the repository at the top level.

(rlpyt-astooke) seita@stout:~/rlpyt (master) $ cat plot_csv_training.py
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import os
import pickle
import numpy as np
import pandas as pd
from os.path import join
from collections import defaultdict

# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
er_alpha = 0.25

def plot_csv(args, exps, plot_wrt_steps=False):
    """Plot from the progress csv file, only if in `col_to_include`.

    The `exp` should NOT include the run_0, etc stuff. Because, we want exp to
    be able to average across all of those runs.
    """
    col_to_include = [
        #'CumSteps',
        #'CumTime (s)',
        'ReturnAverage',
        #'StepsPerSecond',
    ]
    col_to_include = sorted(col_to_include)

    progfiles = []
    dfs = []
    for exp in exps:
        progfile = join(exp, 'progress.csv')
        progfiles.append(progfile)
        df = pd.read_csv(progfile, delimiter = ',')
        df = df.reindex(sorted(df.columns), axis=1)
        print("loaded csv, with shape {}".format(df.shape))
        dfs.append(df)
    print("including these {} columns: {}".format(len(col_to_include), col_to_include))

    # Next, get the plot set up, one row per statistic?
    nrows, ncols = len(col_to_include), 1
    fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
                           figsize=(13*ncols,4*nrows))
    title = os.path.basename(exp)

    k= 10
    row = 0
    for column in dfs[0]:
        if column not in col_to_include:
            continue
        print(row, column)

        # Now take the mean and std.
        # Or we can just put a bunch of them together.
        for df in dfs:
            data = df[column].tolist()
            label = 'avg last {}, {:.3f}'.format(k, np.mean(data[-k:]))
            if plot_wrt_steps:
                csteps = np.array(df['CumSteps'].tolist()) / 1e6
                ax[row,0].plot(csteps, data, label=label)
                ax[row,0].set_ylabel('1e6 CumSteps', fontsize=ysize)
            else:
                ax[row,0].plot(data, label=label)
        ax[row,0].set_title(column, fontsize=titlesize)
        ax[row,0].tick_params(axis='x', labelsize=ticksize)
        ax[row,0].tick_params(axis='y', labelsize=ticksize)
        leg = ax[row,0].legend(loc="best", ncol=1, prop={'size':legendsize})
        for legobj in leg.legendHandles:
            legobj.set_linewidth(5.0)
        row += 1
    plt.tight_layout()

    # Do some hacks, very specific to the file directory
    _head, _tail = os.path.split(exp)  # _tail is 'run_1' if we ran two trials, 'run_0' if one, etc.
    _, _script = os.path.split(_head)  # e.g., example_5_space_invaders
    _gamename = (_script).replace('example_5_','') # hacky
    print(_head, _gamename)
    figname = join(_head,'{}_rlpyt.png'.format(_gamename))

    plt.savefig(figname)
    print("Just saved: {}\n".format(figname))


if __name__ == "__main__":
    pp = argparse.ArgumentParser()
    pp.add_argument('path', type=str)
    args = pp.parse_args()
    assert args.path is not None
    exps = sorted(
            [join(args.path,x) for x in os.listdir(args.path) if 'run_' in x]
    )
    print('\nPlotting:')
    for exp in exps:
        print('\t',exp)
    plot_csv(args, exps, plot_wrt_steps=True)

I just run the following commands to plot:

DQN results:

python plot_csv_training.py data/local/20200105/example_5_pong/
python plot_csv_training.py data/local/20200105/example_5_boxing/
python plot_csv_training.py data/local/20200105/example_5_breakout/   (10M steps)
python plot_csv_training.py data/local/20200106/example_5_breakout/   (50M steps)
python plot_csv_training.py data/local/20200106/example_5_space_invaders/
python plot_csv_training.py data/local/20200106/example_5_seaquest/

PD-DQN:

python plot_csv_training.py data/local/20200110/example_5_breakout/

Results

Here are my results.

DQN

Pong:

pong_rlpyt

Breakout, 10M steps:

breakout_rlpyt_10M

Breakout, 50M steps:

breakout_rlpyt_50M

Boxing:

boxing_rlpyt

Seaquest:

seaquest_rlpyt

Space Invaders:

space_invaders_rlpyt

PD-DQN

breakout_rlpyt

Thoughts / Comments

Pong: seems to be fine, gets 20+ as usual, quickly, no issues. Running 10M steps is fine, no need for 50M.
Boxing: gets to roughly 80 points, which is actually slightly higher than the values reported in the PD-DQN appendix, which are roughly 68-75 depending on which value is used. Boxing has a hard limit of 100, I think, and most learning curves show rapid improvement and then stagnation, like in Pong, so running just 10M steps is probably fine.
Breakout: my biggest concern here. The DQN results seem to be stuck at below 100, and running 50M steps did not show any improvement, reward continued to stagnate. Running PD-DQN seems to make things even worse, with rewards stuck at roughly 50 after 10M steps and it’s unclear if the curves would increase given more time. The PD-DQN paper reports Breakout getting at least 300+ points for a variety of settings, with the lowest value (of around 300) coming with vanilla DQN. This is still much better than what I am seeing. Breakout is another one of those games where most learning curves I see show rapid improvement over 10M steps and then stagnation from roughly 10M to 50M steps, and I’m not sure if running for 50M steps would help. (And in any case I did one trial with 50M steps.)
Seaquest: while my curves are improving at the 10M mark, the reward is much too low, PD-DQN reports getting values of roughly 10k to 40k. In addition, random performance gets 215.5 reward.
Space Invaders: stuck at 60 points, whereas PD-DQN gets several thousand, with a value of 9000+ reported for the proportional PER setting. Also, random performance gets 182.6 points…

Some more thoughts:

My biggest concern here is Breakout. I’m not sure why the ReturnAverage is so low, especially because Breakout is supposed to be perhaps the second-most reliable Atari environment after Pong.
It could be a hyper-parameter issue, but the main difference, I think, is that with Deepmind’s results, they report the serial, single-thread case of 50M steps. Is that really enough for a huge enough point gap?
Is the policy epsilon decaying in a similar manner as in Deepmind papers, with epsilon annealed from 1.0 to 0.1 over the first 1 million steps? (Then perhaps decayed further to 0.01?)
Another possibility could be that the return itself isn’t the actual return from the environment, but the clipped reward. I ran into this problem in other code bases so I want to check that this is not the issue. I know for Boxing, the natural rewards should have absolute value 1, just like in Pong, but the other games have rewards that are higher and need to be clipped.

I am going to be running some more benchmarks, so stay tuned for additional results …

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:6 (6 by maintainers)

Top GitHub Comments

3reactions

DanielTakeshicommented, Jan 19, 2020

Hi @astooke There is some better news now. I ran:

python examples/example_5.py --n_parallel 4  --game breakout --cuda_idx 0 --run_ID 1

with the changes:

double_dqn=True
prioritized_replay=True (thus, this is “PD-DQN”).
With AtariTrajInfo as the TrajInfoCls.

I ran the above, one for a 10M step run (run_ID 0), and the other for a 50M step run (run_ID 1, and the 50M step is the setting Deepmind usually uses). My results are in the data/local/20200117 folder.

I ran this command:

python plot_csv_training.py data/local/20200117/example_5_breakout/

Here is the result:

breakout_rlpyt

The 50M step run now converges to a score that I would expect.
The 10M step seems like it would have converged faster had I run it for longer.
Returns are noisy as expected, and this is despite averaging over the last 100 episodes. I believe all statistics are recorded over the prior 100 episodes.

I think we can confidently proceed with the DQN-based algorithms on Atari. I’m happy to close this issue but I strongly recommend changing all the Atari-based examples to use the correct TrajInfoCls, it doesn’t seem like anyone would want to use the default TrajInfo class.

2reactions

DanielTakeshicommented, Jan 17, 2020

@astooke I think I finally figured out a simpler way. Here’s the example_5 script which was what I was using above:

https://github.com/astooke/rlpyt/blob/75e96cda433626868fd2a30058be67b99bbad810/examples/example_5.py#L26-L38

and here’s a script deeper into the file hierarchy that explicitly defines the TrajInfoCls!

https://github.com/astooke/rlpyt/blob/75e96cda433626868fd2a30058be67b99bbad810/rlpyt/experiments/scripts/atari/dqn/train/atari_dqn_cpu.py#L24-L32

When I ran something like that, I get a GameScore reported in the logger. Let me run this to completion and see how that works.

Top Results From Across the Web

Why does shifting all the rewards have a different impact on ...

In a continuing problem, there is no way for the agent to avoid the stream of new reward data. That means any positive...

Poor Agent Performance in Call Centers? How to Improve It?

It is important to train and equip call center agents in order to maximize their performance. Follow these tips to improve call center...

Prohibited Personnel ... - U.S. Merit Systems Protection Board

Among them are adverse actions (removal, demotion, suspension for more than 14 days, reductions in grade and pay, and furloughs for 30 days...

Performance Expectations = Results + Actions & Behaviors

In discussing performance expectations an employee should understand why the job exists, where it fits in the organization, and how the job's responsibilities ......

DQN-based gradual fisheye image rectification - ScienceDirect

In particular, we treat the fisheye image rectification problem as one Markov ... performance improvement compared to some state-of-the-art methods.