`evaluate_policy` reports preprocessed reward, whereas `rollout/ep_rew_mean` is unprocessed
See original GitHub issueDescribe the bug
There are two kinds of reward that it might make sense to report:
- The unwrapped reward. This is computed by the original environment. This defines the objective of the RL task, and what would usually be used to compute the policy return when e.g. benchmarking the resulting policy for papers.
- The wrapped reward. This is the reward the RL algorithm actually uses for training, after any environment wrappers that mess with reward like
VecNormalize
orAtariPreprocessing
have been applied. We hope the arg max of this is the same as the arg max of the original environment reward, but it need not be. Even if it is, they can vary substantially – e.g. rescaling by a positive constant, or addition of shaping.
Currently, common.evaluation.evaluate_policy
reports the wrapped reward. Most (all?) RL algorithms report the unwrapped reward, e.g. OffPolicyAlgorithm._dump_logs
in rollout/ep_rew_mean
. The difference is common.evaluate.evaluate_policy
directly records reward and computes statistics; whereas the RL algorithms rely on the "r"
and "l"
keys inserted into the info-dict by common.Monitor
which is usually the first wrapper applied to the environment (before any preprocessing).
In my opinion we would ideally report both types of reward (and episode length – since wrappers can also affect done
), in both the training and evaluation environment. But if we had to pick just one, I’d advocate for swapping the two – report the reward used for training during training, and the reward usually used for evaluation for the evaluation environment.
Credit to @ejmichaud for first noticing this discrepancy.
Code example This is visible in train.py from Zoo on Atari:
$ python train.py --algo dqn --env SeaquestNoFrameskip-v4
# [ ... elided ... ]
Eval num_timesteps=10000, episode_reward=52.00 +/- 48.33
Episode length: 527.20 +/- 139.00
New best mean reward!
----------------------------------
| eval/ | |
| mean_ep_length | 527 |
| mean_reward | 52 |
| rollout/ | |
| ep_len_mean | 2.16e+03 |
| ep_rew_mean | 87 |
| exploration rate | 0.989 |
| time/ | |
| episodes | 20 |
| fps | 717 |
| time_elapsed | 14 |
| total timesteps | 10698 |
----------------------------------
Note that not only do eval/
and rollout/
disagree on reward per timestep (because AtariPreprocessing
does reward clipping), they also disagree on the episode length (because AtariPreprocessing
converts loss of life into loss of an episode)! See seaquest.log for full log.
Note train.py
does do some magic to disable reward normalization in VecNormalize
. So this problem I think won’t be visible in e.g. MuJoCo. Conceivably one could do something similar for AtariWrapper
(set terminal_on_life_loss=False
and clip_reward=False
) – but doing this for every wrapper seems error-prone, and some wrappers may just not support this out of the box.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:10 (4 by maintainers)
Top GitHub Comments
yes, it is not possible. And wrapping with a
VecMonitor
won’t solve the issue of clipped reward by lower level wrappers. But I thought we agreed on “So yes, defaulting to unnormalized reward sounds good and warning the user if the env is not wrapped properly. This would also require some update to the documentation.”we cannot as many env rely on vec env wrappers like
VecNormalize
,VecFramestack
(atari), …@araffin Yup I will! Start of the week was super-busy with urgent deadlines but today I will return to this ^^’