question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] NaN's if a single observation is sampled from PPO's rollout_buffer

See original GitHub issue

🐛 Bug

I am using PPO and am adapting a few things of it to experiment with an unusual adversary setting, including for now in an ad-hoc manner adjusting the rollout-buffer way to have a dynamic size for it. Doing this I found that PPO can run into nan values whenever it samples a batch size of size 1 due to this in PPO.py, line 170:

                advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

where advantages.std() over a batch of size 1 is not defined.

While my current scenario is of course nothing common, this seems to me like a case that could more generally occur whenever someone uses a btach_size of size rollout_buffer.size + 1 or someone wants to use a dynamically sized replay buffer. It could also easily be accounted for.

To Reproduce

Run PPO with a batch_size of 1 or a rollout_buffer of size batch_size + 1 (i.e. by setting n_steps to this value)

Expected behavior

I would suggest to either discard batches when only a single observation is left in the rollout_buffer or check the size of the sampled buffer and set the std to either 1 or the minimum value in this case.

### System Info

stable-baselines v. 0.10.0

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have read the documentation (required)
  • I have provided a minimal working example to reproduce the bug (required)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
hughperkinscommented, Oct 11, 2022

Oh, maybe emails are rate limited, so you will receive these new emails in December and February respectively, lol 😄

On Tue, Oct 11, 2022, 04:26 Hugh Perkins @.***> wrote:

I sent that email in August 😛

On Tue, Oct 11, 2022, 04:25 Hugh Perkins @.***> wrote:

That’s a very old email. Not sure why it is appearing as a comment now…

On Tue, Oct 11, 2022, 04:09 Antonin RAFFIN @.***> wrote:

@hughperkins https://github.com/hughperkins not sure to understand your comment… this issue was fixed by yourself in #1028 https://github.com/DLR-RM/stable-baselines3/pull/1028

— Reply to this email directly, view it on GitHub https://github.com/DLR-RM/stable-baselines3/issues/325#issuecomment-1274272922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6FKFNOJ6BVLO7YW5OJPLWCUOCFANCNFSM4X7CA53Q . You are receiving this because you were mentioned.Message ID: @.***>

1reaction
hughperkinscommented, Aug 23, 2022

I think there are two issues we can factorize:

  • how to avoid the nans => the advantage fix you allude to. I agree
  • should partial batches be skipped => this would be backwards incomptaible, so should be a new option, that defaults to not skipping partial batches.

For now, I’m just going to leave this code and comments here. Maybe I will submit a PR for either or both of these. My own code works now anyway, and I’ve made the code available that I’m using, via this PR, if anyone else wants to use it.

On Tue, Aug 23, 2022, 09:20 Hugh Perkins @.***> wrote:

My buffer size is not fixed. I collect the roll out myself, using a custom method, therefore it is a probability. It fails one time in batch size. I bumped the batch size up to 2048, and the nans were less frequent. I thought this was because batch size 2048 stabilized the learning. But it was because the bug only occurs with probability 1/batch size, for me

On Tue, Aug 23, 2022, 08:54 Antonin RAFFIN @.***> wrote:

(didn’t see your comment whilst working on the PR).

that’s why we recommend to discuss and agree on the solution first in an issue… I will still try to take a look at your PR in the coming days.

Thinking about it again, I think the best would be to skip advantage normalization if the batch size is of size 1.

But… do we want to train on partial batches?

we usually want to train on all the collected data.

The learning rate will be wrong for the partial batches,

that’s true, at the same time there should be only one partial batch.

if batch size is not a factor, then each learning stage, there is a probability of 1: batch_size of the buffer returning batch size of 1 for last batch

this should not be a probability. It return a batch of size 1 if buffer_size % batch_size == 1. For instance:

from stable_baselines3 import PPOfrom stable_baselines3.common.env_util import make_vec_env

With one envn_envs = 1n_steps = 64batch_size = 63# With two# n_envs = 2# n_steps = 32# batch_size = 63

buffer_size = n_envs * n_stepsassert buffer_size % batch_size == 1 env = make_vec_env(“Pendulum-v1”, n_envs=n_envs)# Fix1: normalize_advantage=FalsePPO(“MlpPolicy”, env, verbose=1, batch_size=batch_size, n_steps=n_steps).learn(10_000)

— Reply to this email directly, view it on GitHub https://github.com/DLR-RM/stable-baselines3/issues/325#issuecomment-1224034225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6FKEGYVHJ5WGS5EYLET3V2TCZNANCNFSM4X7CA53Q . You are receiving this because you commented.Message ID: @.***>

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Bug] Silent NaNs in PPO Loss Calculation if n_steps=1 and ...
The only thing that can happen when using the rollout buffer with PPO, ... [Bug] NaN's if a single observation is sampled from...
Read more >
Changelog — Stable Baselines3 1.7.0a5 documentation
The env checker now raises an error when using dict observation spaces and ... Fixed issue where PPO gives NaN if rollout buffer...
Read more >
Variance of a mean resulting from observations sampled from ...
Let's say we have three sets of observations, A (4 observations), ... and we randomly choose whether or not to sample the single...
Read more >
stable-baselines Changelog - pyup.io
Fixed DDPG sampling empty replay buffer when combined with HER (tirafesi) ... Fixed bug in ``TRPO`` with NaN standardized advantages (richardwu)
Read more >
dissertation_Sun.pdf - Shao-Hua Sun
In [161], we study learning from observation (i.e. imitate demonstrators without ... Note that when the number of sampled initial states becomes infinitely ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found