Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support learn() with total timesteps less than episode length

See original GitHub issue

In an imitation learning project, I’m alternating short calls (low total_timesteps) to PPO2.learn() with gradient descent updates to the discriminator among other things.

The PPO2 updates themselves seem to be working fine. However, logging has a few problems because some logging state is not kept between different calls to learn(). Despite the fact that I use .learn(reset_num_timesteps=False), which seems to exist to allow logging between multiple calls to PPO2.learn(), we run these problems:

(1) We initialize a new Runner every time we call learn, even if reset_num_timesteps=False. This forces the environment to reset(), thus biasing the logger towards reporting the reward and episode length means of shorter episodes (longer episodes don’t get to finish by the end of the training loop). As an example, when training CartPole, my expert imitation policies (mean return: 500) often show mean return around 300 in my PPO2 logs.

(2) The ep_info_buf is reset every time we call learn instead of being saved. This makes the training curve more jagged and again biased toward shorter episodes at the beginning of each new call to learn().

I’m wondering if the maintainers would be interested in some sort of PR that fixes these logging discrepancies.

I propose that reset_num_timesteps=False (perhaps rename to reset_log_state) should make PPO2 keep state from (1) and (2).

Issue Analytics

State:
Created 4 years ago
Comments:16

Top GitHub Comments

1reaction

AdamGleavecommented, Dec 18, 2019

After talking to @shwang the problem is more serious than I thought: it affects learning not just logging. AbstractEnvRunner calls env.reset in the __init__ method. So any algorithm using Runner when you call learn with fewer timesteps than the episode length will never see the episode end during training! This would cause big problems in environments with sparse reward at the end of an episode.

Admittedly this use case is rare so it’s fairly low-severity, but think we should treat learning breaking in this setting as a bug.

0reactions

araffincommented, Jul 16, 2020

This should be fixed in v3, but needs to be checked, linking https://github.com/DLR-RM/stable-baselines3/issues/1

Top Results From Across the Web

Understanding the total_timesteps parameter in stable ...

According to the stable-baselines source code. total_timesteps is the number of steps in total the agent will do for any environment.

Reinforcement Learning in Python with Stable Baselines 3

This allows us to see the actual total number of timesteps for the model rather than resetting every iteration. We're also setting a...

Reinforcement Learning Tips and Tricks - Stable Baselines

The aim of this section is to help you doing reinforcement learning ... Looking at the training curve (episode reward function of the...

How to model episodic task with pre-determined total time?

I want to model a problem as an MDP and solve it with reinforcement learning algorithms. Suppose that the problem is episodic and...

Stable Baselines Documentation - Read the Docs

You can find a migration guide in SB3 documentation. 1.3 Reinforcement Learning Tips and Tricks. The aim of this section is to help...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Support learn() with total timesteps less than episode length

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[question] [Proposal] Maximum Iterations Per Episode

[proposal] Public `VecNormalize._normalize_observation`