Gamma in VecNormalize for rms updates.
See original GitHub issueI have observed that the VecNormalize class updates the reward running mean statistics with what looks to me like a discounted reward:
def step_wait(self):
obs, rews, news, infos = self.venv.step_wait()
self.ret = self.ret * self.gamma + rews # here
obs = self._obfilt(obs)
if self.ret_rms:
self.ret_rms.update(self.ret)
rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
return obs, rews, news, infos
I can’t see why this helps at all, I would directly use rews in the rms update. Am I missing something here?
Thanks.
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
stable_baselines.common.vec_env.vec_normalize
[docs]class VecNormalize(VecEnvWrapper): """ A moving average, normalizing wrapper for vectorized environment. It is pickleable which will save moving ...
Read more >TeachMyAgent.students.openai_baselines.common.vec_env ...
A vectorized wrapper that normalizes the observations and returns from an environment. Expand source code class VecNormalize(VecEnvWrapper): "" ...
Read more >Stable Baselines Documentation - Read the Docs
Don't forget to save the VecNormalize statistics when saving the agent ... running A2C policy gradient updates on the model. import gym.
Read more >Python Examples of gym.envs - ProgramCreek.com
... 1: if gamma is None: envs = VecNormalize(envs, ret=False) else: envs ... to remove updates def _obfilt(self, obs): if self.ob_rms: obs =...
Read more >MRPO.examples.baselines.common.vec_env.vec_normalize ...
VecNormalize taken from open source projects. ... nsteps=nsteps, gamma=gamma, lam=lam) tfirststart = time.time() update = 0 env_steps_all = 0 env_steps_used ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Normalization is important because neural networks trained with Adam (and most other algorithms, as well) don’t learn well with very large or very small targets. Returns are what we really want to normalize, because they are ultimately what drive the value function and policy.
To see why normalizing returns is different than normalizing rewards, take the simple case where the rewards are large but cancel each other out. For example, the rewards might be
[100, -101, 102, -99, ...]
. In this case, the returns will be fairly small, even though the rewards are large. Thus, the advantages we feed into the RL algorithm will already be fairly well-behaved. In a different environment, the rewards might be[10, 10, 10, 10, ...]
. In this case, the rewards are not that large, but the returns are likely to be on the order of 1000 if we use gamma=0.99.Regarding 1), the gamma corresponds to the discount factor used in the RL algorithm. We want to normalize rewards so that the advantages in the RL algorithm have a nice magnitude, meaning that we should use the same gamma that’s used to compute the advantages. The returns are being averaged over the entire course of training by
ret_rms
, so the gamma doesn’t directly effect how fast our normalization coefficient is changing.Regarding 2), subtracting the mean from the rewards would change the dynamics of the environment. For example, if your rewards are all 1, that is very different than if your rewards are all 0. In the former case, the agent wants to live; in the latter, it doesn’t care if it lives or dies. The reward clipping looks like a bit of a hack, and there’s probably no way to do it truly “correctly”. Centering the clip range at the mean reward wouldn’t quite make sense, I don’t think, and I can’t think of a perfect way to do it.