[Question] Justifying advantage normalization for PPO
See original GitHub issueQuestion
For PPO, I understand that advantage normalization (for each batch of experiences) is sort of a standard practice. I’ve seen other implementations do it, too. However, I find it a little un-justified and here’s why.
If we are using GAE, then each advantage is a weighted sum of a whole bunch of td deltas: r+gamma * V(s')-V(s)
. Suppose most of these deltas are positive (which is not an unreasonable assumption, especially when training is going well, i.e., when the action taken is increasingly better than the “average action”), then advantages for earlier transitions would be higher than those for later transitions, simply because towards the end of episode there are less td deltas to sum.
In this case, normalizing advantages (which involves subtracting the mean) would give early transitions positive advantage and later transitions negative advantage, which might affect performance and doesn’t make sense intuitively. Also, the gist of policy gradient algorithms is that we should encourage an action with positive advantage whenever we can, and some arguments like “give the model something to encourage and something to discourage every batch of updates” is not convincing enough.
Are there stronger justifications (e.g., papers) on why advantage normalization should be used by default in SB3? Have anyone investigated the practical differences?
A more sound alternative seems to be dividing by the max or std, without subtracting the mean.
Thanks!
Context
I’ve checked this issue but it doesn’t resolve my confusion (it’s not even closed lol):
Checklist
- I have read the documentation (required)
- I have checked that there is no similar issue in the repo (required)
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Top GitHub Comments
Sorry for the delay.
@araffin Yes, what I said indeed does not happen when you bootstrap correctly at the final step (I checked the code in stable-baselines3 again, which does exactly this).
But the problem persists in the case when people don’t bootstrap in the final step (in continuous control env; in episodic env, of course no bootstrap is needed when the task end gracefully). This happens when people use the one-sample return in place of the advantage. According to my knowledge, this is how most people implement their first policy gradient project (with e.g., cartpole), but it still works.
In response to this, maybe a plot would help, but I think it’s quite self-evident. Let me know what you think!
@Miffyli Regarding the empirical study you mentioned, I think it’s great. Here’s a more mathy justification for normalization of advantages (from CS258 lecture 6 slide “Critics as state-dependent baselines” for those who are interested):
Hope it helps!
I’m digging into this a bit, so let’s keep this issue open and I will post what I found for future references.