Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`evaluate_policy` reports preprocessed reward, whereas `rollout/ep_rew_mean` is unprocessed

See original GitHub issue

Describe the bug

There are two kinds of reward that it might make sense to report:

The unwrapped reward. This is computed by the original environment. This defines the objective of the RL task, and what would usually be used to compute the policy return when e.g. benchmarking the resulting policy for papers.
The wrapped reward. This is the reward the RL algorithm actually uses for training, after any environment wrappers that mess with reward like VecNormalize or AtariPreprocessing have been applied. We hope the arg max of this is the same as the arg max of the original environment reward, but it need not be. Even if it is, they can vary substantially – e.g. rescaling by a positive constant, or addition of shaping.

Currently, common.evaluation.evaluate_policy reports the wrapped reward. Most (all?) RL algorithms report the unwrapped reward, e.g. OffPolicyAlgorithm._dump_logs in rollout/ep_rew_mean. The difference is common.evaluate.evaluate_policy directly records reward and computes statistics; whereas the RL algorithms rely on the "r" and "l" keys inserted into the info-dict by common.Monitor which is usually the first wrapper applied to the environment (before any preprocessing).

In my opinion we would ideally report both types of reward (and episode length – since wrappers can also affect done), in both the training and evaluation environment. But if we had to pick just one, I’d advocate for swapping the two – report the reward used for training during training, and the reward usually used for evaluation for the evaluation environment.

Credit to @ejmichaud for first noticing this discrepancy.

Code example This is visible in train.py from Zoo on Atari:

$ python train.py --algo dqn --env SeaquestNoFrameskip-v4
# [ ... elided ... ]
Eval num_timesteps=10000, episode_reward=52.00 +/- 48.33
Episode length: 527.20 +/- 139.00
New best mean reward!
----------------------------------
| eval/               |          |
|    mean_ep_length   | 527      |
|    mean_reward      | 52       |
| rollout/            |          |
|    ep_len_mean      | 2.16e+03 |
|    ep_rew_mean      | 87       |
|    exploration rate | 0.989    |
| time/               |          |
|    episodes         | 20       |
|    fps              | 717      |
|    time_elapsed     | 14       |
|    total timesteps  | 10698    |
----------------------------------

Note that not only do eval/ and rollout/ disagree on reward per timestep (because AtariPreprocessing does reward clipping), they also disagree on the episode length (because AtariPreprocessing converts loss of life into loss of an episode)! See seaquest.log for full log.

Note train.py does do some magic to disable reward normalization in VecNormalize. So this problem I think won’t be visible in e.g. MuJoCo. Conceivably one could do something similar for AtariWrapper (set terminal_on_life_loss=False and clip_reward=False) – but doing this for every wrapper seems error-prone, and some wrappers may just not support this out of the box.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:10 (4 by maintainers)

Top GitHub Comments

2reactions

araffincommented, Nov 6, 2020

I tried to work on this but ran into some design decisions I could not overcome. I was updating evaluate_policy to always use Monitor’s episode info, but automatically wrapping the eval environment (in case it is not already wrapped) is tricky with vecenvs (if subprocvecenv,

yes, it is not possible. And wrapping with a VecMonitor won’t solve the issue of clipped reward by lower level wrappers. But I thought we agreed on “So yes, defaulting to unnormalized reward sounds good and warning the user if the env is not wrapped properly. This would also require some update to the documentation.”