[Bug] NaN's if a single observation is sampled from PPO's rollout_buffer
See original GitHub issue🐛 Bug
I am using PPO and am adapting a few things of it to experiment with an unusual adversary setting, including for now in an ad-hoc manner adjusting the rollout-buffer way to have a dynamic size for it. Doing this I found that PPO can run into nan values whenever it samples a batch size of size 1 due to this in PPO.py
, line 170:
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
where advantages.std()
over a batch of size 1 is not defined.
While my current scenario is of course nothing common, this seems to me like a case that could more generally occur whenever someone uses a btach_size of size rollout_buffer.size + 1 or someone wants to use a dynamically sized replay buffer. It could also easily be accounted for.
To Reproduce
Run PPO with a batch_size of 1 or a rollout_buffer of size batch_size + 1 (i.e. by setting n_steps
to this value)
Expected behavior
I would suggest to either discard batches when only a single observation is left in the rollout_buffer or check the size of the sampled buffer and set the std to either 1 or the minimum value in this case.
### System Info
stable-baselines v. 0.10.0
Checklist
- I have checked that there is no similar issue in the repo (required)
- I have read the documentation (required)
- I have provided a minimal working example to reproduce the bug (required)
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (15 by maintainers)
Top GitHub Comments
Oh, maybe emails are rate limited, so you will receive these new emails in December and February respectively, lol 😄
On Tue, Oct 11, 2022, 04:26 Hugh Perkins @.***> wrote:
I think there are two issues we can factorize:
For now, I’m just going to leave this code and comments here. Maybe I will submit a PR for either or both of these. My own code works now anyway, and I’ve made the code available that I’m using, via this PR, if anyone else wants to use it.
On Tue, Aug 23, 2022, 09:20 Hugh Perkins @.***> wrote: