Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] NaN's if a single observation is sampled from PPO's rollout_buffer

See original GitHub issue

🐛 Bug

I am using PPO and am adapting a few things of it to experiment with an unusual adversary setting, including for now in an ad-hoc manner adjusting the rollout-buffer way to have a dynamic size for it. Doing this I found that PPO can run into nan values whenever it samples a batch size of size 1 due to this in PPO.py, line 170:

                advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

where advantages.std() over a batch of size 1 is not defined.

While my current scenario is of course nothing common, this seems to me like a case that could more generally occur whenever someone uses a btach_size of size rollout_buffer.size + 1 or someone wants to use a dynamically sized replay buffer. It could also easily be accounted for.

To Reproduce

Run PPO with a batch_size of 1 or a rollout_buffer of size batch_size + 1 (i.e. by setting n_steps to this value)

Expected behavior

I would suggest to either discard batches when only a single observation is left in the rollout_buffer or check the size of the sampled buffer and set the std to either 1 or the minimum value in this case.

### System Info

stable-baselines v. 0.10.0

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

Issue Analytics

State:
Created 3 years ago
Comments:16 (15 by maintainers)

Top GitHub Comments

1reaction

hughperkinscommented, Oct 11, 2022

Oh, maybe emails are rate limited, so you will receive these new emails in December and February respectively, lol 😄

On Tue, Oct 11, 2022, 04:26 Hugh Perkins @.***> wrote:

I sent that email in August 😛

On Tue, Oct 11, 2022, 04:25 Hugh Perkins @.***> wrote:

That’s a very old email. Not sure why it is appearing as a comment now…

On Tue, Oct 11, 2022, 04:09 Antonin RAFFIN @.***> wrote:

@hughperkins https://github.com/hughperkins not sure to understand your comment… this issue was fixed by yourself in #1028 https://github.com/DLR-RM/stable-baselines3/pull/1028

— Reply to this email directly, view it on GitHub https://github.com/DLR-RM/stable-baselines3/issues/325#issuecomment-1274272922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6FKFNOJ6BVLO7YW5OJPLWCUOCFANCNFSM4X7CA53Q . You are receiving this because you were mentioned.Message ID: @.***>

1reaction

hughperkinscommented, Aug 23, 2022

I think there are two issues we can factorize:

how to avoid the nans => the advantage fix you allude to. I agree
should partial batches be skipped => this would be backwards incomptaible, so should be a new option, that defaults to not skipping partial batches.

For now, I’m just going to leave this code and comments here. Maybe I will submit a PR for either or both of these. My own code works now anyway, and I’ve made the code available that I’m using, via this PR, if anyone else wants to use it.

On Tue, Aug 23, 2022, 09:20 Hugh Perkins @.***> wrote:

My buffer size is not fixed. I collect the roll out myself, using a custom method, therefore it is a probability. It fails one time in batch size. I bumped the batch size up to 2048, and the nans were less frequent. I thought this was because batch size 2048 stabilized the learning. But it was because the bug only occurs with probability 1/batch size, for me

On Tue, Aug 23, 2022, 08:54 Antonin RAFFIN @.***> wrote:

(didn’t see your comment whilst working on the PR).

that’s why we recommend to discuss and agree on the solution first in an issue… I will still try to take a look at your PR in the coming days.

Thinking about it again, I think the best would be to skip advantage normalization if the batch size is of size 1.

But… do we want to train on partial batches?

we usually want to train on all the collected data.

The learning rate will be wrong for the partial batches,

that’s true, at the same time there should be only one partial batch.

if batch size is not a factor, then each learning stage, there is a probability of 1: batch_size of the buffer returning batch size of 1 for last batch

this should not be a probability. It return a batch of size 1 if buffer_size % batch_size == 1. For instance:

from stable_baselines3 import PPOfrom stable_baselines3.common.env_util import make_vec_env

With one envn_envs = 1n_steps = 64batch_size = 63# With two# n_envs = 2# n_steps = 32# batch_size = 63

buffer_size = n_envs * n_stepsassert buffer_size % batch_size == 1 env = make_vec_env(“Pendulum-v1”, n_envs=n_envs)# Fix1: normalize_advantage=FalsePPO(“MlpPolicy”, env, verbose=1, batch_size=batch_size, n_steps=n_steps).learn(10_000)

— Reply to this email directly, view it on GitHub https://github.com/DLR-RM/stable-baselines3/issues/325#issuecomment-1224034225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6FKEGYVHJ5WGS5EYLET3V2TCZNANCNFSM4X7CA53Q . You are receiving this because you commented.Message ID: @.***>

Top Results From Across the Web

[Bug] Silent NaNs in PPO Loss Calculation if n_steps=1 and ...

The only thing that can happen when using the rollout buffer with PPO, ... [Bug] NaN's if a single observation is sampled from...

Changelog — Stable Baselines3 1.7.0a5 documentation

The env checker now raises an error when using dict observation spaces and ... Fixed issue where PPO gives NaN if rollout buffer...

Variance of a mean resulting from observations sampled from ...

Let's say we have three sets of observations, A (4 observations), ... and we randomly choose whether or not to sample the single...

stable-baselines Changelog - pyup.io

Fixed DDPG sampling empty replay buffer when combined with HER (tirafesi) ... Fixed bug in ``TRPO`` with NaN standardized advantages (richardwu)

dissertation_Sun.pdf - Shao-Hua Sun

In [161], we study learning from observation (i.e. imitate demonstrators without ... Note that when the number of sampled initial states becomes infinitely ......