Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Gamma in VecNormalize for rms updates.

See original GitHub issue

I have observed that the VecNormalize class updates the reward running mean statistics with what looks to me like a discounted reward:

def step_wait(self):
    obs, rews, news, infos = self.venv.step_wait()
    self.ret = self.ret * self.gamma + rews # here
    obs = self._obfilt(obs)
    if self.ret_rms:
        self.ret_rms.update(self.ret)
        rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
    return obs, rews, news, infos

I can’t see why this helps at all, I would directly use rews in the rms update. Am I missing something here?

Thanks.

Issue Analytics

State:
Created 5 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

5reactions

unixpicklecommented, Aug 26, 2018

Normalization is important because neural networks trained with Adam (and most other algorithms, as well) don’t learn well with very large or very small targets. Returns are what we really want to normalize, because they are ultimately what drive the value function and policy.

To see why normalizing returns is different than normalizing rewards, take the simple case where the rewards are large but cancel each other out. For example, the rewards might be [100, -101, 102, -99, ...]. In this case, the returns will be fairly small, even though the rewards are large. Thus, the advantages we feed into the RL algorithm will already be fairly well-behaved. In a different environment, the rewards might be [10, 10, 10, 10, ...]. In this case, the rewards are not that large, but the returns are likely to be on the order of 1000 if we use gamma=0.99.

3reactions

unixpicklecommented, Aug 29, 2018

Regarding 1), the gamma corresponds to the discount factor used in the RL algorithm. We want to normalize rewards so that the advantages in the RL algorithm have a nice magnitude, meaning that we should use the same gamma that’s used to compute the advantages. The returns are being averaged over the entire course of training by ret_rms, so the gamma doesn’t directly effect how fast our normalization coefficient is changing.

Regarding 2), subtracting the mean from the rewards would change the dynamics of the environment. For example, if your rewards are all 1, that is very different than if your rewards are all 0. In the former case, the agent wants to live; in the latter, it doesn’t care if it lives or dies. The reward clipping looks like a bit of a hack, and there’s probably no way to do it truly “correctly”. Centering the clip range at the mean reward wouldn’t quite make sense, I don’t think, and I can’t think of a perfect way to do it.

Top Results From Across the Web

stable_baselines.common.vec_env.vec_normalize

[docs]class VecNormalize(VecEnvWrapper): """ A moving average, normalizing wrapper for vectorized environment. It is pickleable which will save moving ...

TeachMyAgent.students.openai_baselines.common.vec_env ...

A vectorized wrapper that normalizes the observations and returns from an environment. Expand source code class VecNormalize(VecEnvWrapper): "" ...

Stable Baselines Documentation - Read the Docs

Don't forget to save the VecNormalize statistics when saving the agent ... running A2C policy gradient updates on the model. import gym.

Python Examples of gym.envs - ProgramCreek.com

... 1: if gamma is None: envs = VecNormalize(envs, ret=False) else: envs ... to remove updates def _obfilt(self, obs): if self.ob_rms: obs =...

MRPO.examples.baselines.common.vec_env.vec_normalize ...

VecNormalize taken from open source projects. ... nsteps=nsteps, gamma=gamma, lam=lam) tfirststart = time.time() update = 0 env_steps_all = 0 env_steps_used ...