[Bug] Silent NaNs in PPO Loss Calculation if n_steps=1 and n_envs=1
See original GitHub issueš Bug
This is somewhere between a bug and a request for more informative errors:
When n_steps and n_envs are both set to 1, the batch returned by a rollout buffer here will be of length 1. This will make the advantages
calculation return nan
values, since the normalization step involves calculating the standard deviation of advantages, which is undefined for a single element.
I recognize that this small of a setting is definitely an edge case (I ran into it during testing, when we were setting all values quite low for speed reasons), so Iām not sure it makes sense to have logic for this case, but at minimum, I think it would be beneficial to have some kind of explicit warning that checks if actions
or advantages
have a single element, so that thereās a clear indication of the source of the issue, rather than having to follow a breadcrumb trail of nan
s from some higher abstraction level of code
To Reproduce
from stable_baselines3 import PPO
env = gym.make('CartPole-v1')
model = PPO('MlpPolicy', env, verbose=1, n_steps=1)
model.learn(total_timesteps=10)
This will fail with an unclear error :
RuntimeError: invalid multinomial distribution (encountering probability entry < 0)
If you put a debugger or insert logging statements at ppo.py:170, youāll be able to see that (1) len(advantages) = 1
and consequently (2) advantages.std() = nan
, which first arises as a visible bug when you try to collect an on-policy rollout after your first training step, since the nan
values in loss propagate into nan
parameter values.
Expected behavior
Either (1) explicit support for training on effective batches of size 1, or (2) a clearer and earlier error when you attempt to construct an algorithm object with n_steps=1
and n_envs=1
, informing the user that the case isnāt supported.
###Ā System Info
Describe the characteristic of your environment:
- Describe how the library was installed (pip, docker, source, ā¦): Cloned from fork of current master, installed via
pip
- GPU models and configuration: N/A
- Python version: 3.7.0
- PyTorch version: 1.7.1
- Gym version: 0.17.3
- Versions of any other relevant libraries: N/A
Additional context
Add any other context about the problem here.
Checklist
- I have checked that there is no similar issue in the repo (required)
- I have read the documentation (required)
- I have provided a minimal working example to reproduce the bug (required)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:13 (10 by maintainers)
Top GitHub Comments
Ah, reading over your comment again, I now think weāre saying the same thing here, except youāre framing it as the last minibatch getting truncated, and in the situation Iām describing, you canāt pull even a single minibatch from the amount of data present in
n_steps*n_envs
, so all batches are truncatedthe issue with n_env * n_step == 1 should be fixed now by https://github.com/DLR-RM/stable-baselines3/pull/1028