Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Silent NaNs in PPO Loss Calculation if n_steps=1 and n_envs=1

See original GitHub issue

🐛 Bug

This is somewhere between a bug and a request for more informative errors:

When n_steps and n_envs are both set to 1, the batch returned by a rollout buffer here will be of length 1. This will make the advantages calculation return nan values, since the normalization step involves calculating the standard deviation of advantages, which is undefined for a single element.

I recognize that this small of a setting is definitely an edge case (I ran into it during testing, when we were setting all values quite low for speed reasons), so I’m not sure it makes sense to have logic for this case, but at minimum, I think it would be beneficial to have some kind of explicit warning that checks if actions or advantages have a single element, so that there’s a clear indication of the source of the issue, rather than having to follow a breadcrumb trail of nans from some higher abstraction level of code

To Reproduce

from stable_baselines3 import PPO

env = gym.make('CartPole-v1')
model = PPO('MlpPolicy', env, verbose=1, n_steps=1)
model.learn(total_timesteps=10)

This will fail with an unclear error :

RuntimeError: invalid multinomial distribution (encountering probability entry < 0)

If you put a debugger or insert logging statements at ppo.py:170, you’ll be able to see that (1) len(advantages) = 1 and consequently (2) advantages.std() = nan, which first arises as a visible bug when you try to collect an on-policy rollout after your first training step, since the nan values in loss propagate into nan parameter values.

Expected behavior

Either (1) explicit support for training on effective batches of size 1, or (2) a clearer and earlier error when you attempt to construct an algorithm object with n_steps=1 and n_envs=1, informing the user that the case isn’t supported.

### System Info

Describe the characteristic of your environment:

Describe how the library was installed (pip, docker, source, …): Cloned from fork of current master, installed via pip
GPU models and configuration: N/A
Python version: 3.7.0
PyTorch version: 1.7.1
Gym version: 0.17.3
Versions of any other relevant libraries: N/A

Additional context

Add any other context about the problem here.

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:13 (10 by maintainers)

Top GitHub Comments

1reaction

decodyngcommented, Dec 21, 2020

Ah, reading over your comment again, I now think we’re saying the same thing here, except you’re framing it as the last minibatch getting truncated, and in the situation I’m describing, you can’t pull even a single minibatch from the amount of data present in n_steps*n_envs, so all batches are truncated

0reactions

hughperkinscommented, Aug 25, 2022

the issue with n_env * n_step == 1 should be fixed now by https://github.com/DLR-RM/stable-baselines3/pull/1028