Support learn() with total timesteps less than episode length
See original GitHub issueIn an imitation learning project, I’m alternating short calls (low total_timesteps
) to PPO2.learn()
with gradient descent updates to the discriminator among other things.
The PPO2 updates themselves seem to be working fine. However, logging has a few problems because some logging state is not kept between different calls to learn()
. Despite the fact that I use .learn(reset_num_timesteps=False)
, which seems to exist to allow logging between multiple calls to PPO2.learn()
, we run these problems:
(1) We initialize a new Runner every time we call learn
, even if reset_num_timesteps=False
. This forces the environment to reset(), thus biasing the logger towards reporting the reward and episode length means of shorter episodes (longer episodes don’t get to finish by the end of the training loop). As an example, when training CartPole, my expert imitation policies (mean return: 500) often show mean return around 300 in my PPO2 logs.
(2) The ep_info_buf
is reset every time we call learn instead of being saved. This makes the training curve more jagged and again biased toward shorter episodes at the beginning of each new call to learn()
.
I’m wondering if the maintainers would be interested in some sort of PR that fixes these logging discrepancies.
I propose that reset_num_timesteps=False
(perhaps rename to reset_log_state
) should make PPO2 keep state from (1) and (2).
Issue Analytics
- State:
- Created 4 years ago
- Comments:16
Top GitHub Comments
After talking to @shwang the problem is more serious than I thought: it affects learning not just logging.
AbstractEnvRunner
callsenv.reset
in the__init__
method. So any algorithm usingRunner
when you calllearn
with fewer timesteps than the episode length will never see the episode end during training! This would cause big problems in environments with sparse reward at the end of an episode.Admittedly this use case is rare so it’s fairly low-severity, but think we should treat learning breaking in this setting as a bug.
This should be fixed in v3, but needs to be checked, linking https://github.com/DLR-RM/stable-baselines3/issues/1